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FOREWORD 


This work started as a wild goose chase for evidence beyond any doubt that supernova data 
show cosmic acceleration. Through a study involving artificial neural networks 1 , trying to find 
parametrisation free constraints on the expansion history of the universe, we ran into trouble 
that led us all the way to reconsider the standard method. We encountered the same problem 
that has been lurking in many previous studies, only in an uncommon disguise. Solving this 
problem for the neural networks degenerated into solving the original problem. Having done 
that, it turned out that this result in itself is interesting. This thesis is more or less laying out 
the article [1] in all gory details. 

The level of rigour throughout is kept, I think, sufficient but not over the top — particularly 
the chapter on statistics suffers at the hands of a physicist. I have tried to keep unnecessary 
details out of the way in favor of results and physical insight. I leave the details to be filled in 
by smarter people — well, more interested people. 

A bunch of humble thanks to everyone who helped me at the institute, including — but 
presumably not limited to — Laure, Andy, Chris, Anne Mette, Sebastian, Assaf, Morten, Jenny, 
Christian, Tristan, and the rest of the Academy and high energy groups, and of course Helle 
and Anette without whom our building would crumble. 

Thanks to Alberto and Subir for company and supervision on this trip through cosmology 
and data analysis. Obviously I extend my gratitude to Subir for teaching me the most valuable 
lesson leading to this work: don't believe any analysis you can't understand and, if time permits, 
carry out the analysis yourself. The following is my attempt at understanding the analysis of 
supernova data. 


1 This subject is interesting in its own right, but I will not have the space to go into any detail about it. 




ABSTRACT 


The cosmological standard model at present is widely accepted as containing mainly things we 
do not understand. In particular the appearance of a Cosmological Constant, or dark energy, 
is puzzling. This was first inferred from the Hubble diagram of a low number of Type la 
supernovae, and later corroborated by complementary cosmological probes. 

Today, a much larger collection of supernovae is available, and here I perform a rigorous 
statistical analysis of this dataset. Taking into account how the supernovae are calibrated to be 
standard candles, we run into some subtleties in the analysis. To our surprise, this new dataset 
— about an order of bigger than the size of the original dataset — shows, under standard 
assumptions, only mild evidence of an accelerated universe. 
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INTRODUCTION 


The present standard model of cosmology explains quite well a host of observations. The 
inclusion of a cosmological constant in Einstein's equations combined with the assumed 
homogeneous and isotropic Friedmann-Robertson-Walker metric description of spacetime gives 
us the hailed ACDM model. A for the inferred cosmological constant, more popularly known 
as dark energy, and CDM is the cold dark matter. Dark because we can't see it, and cold because 
apparently it behaves like non-relativistic particles — compared to (almost) massless particles, 
like neutrinos, which are hot. The baryonic matter 1 is a minor component of the content of the 
universe. 

The usual starting point of the history of modern cosmology is the two groups studying 
supernovae at the end of the nineties, [2, 3]. With observations of very far-away supernovae, the 
two teams independently claimed that the Hubble expansion rate is accelerating and inferred 
from that a best-fit universe with a cosmological constant density parameter around 0 . 7 . These 
results followed a massive experimental effort to find, classify, and calibrate the supernovae. 

The big bang picture of the universe had emerged long before then. From extrapolating 
the expansion of the universe back in time, it was realised that in the past, the universe will 
have been much denser and much hotter. Two consequences of this is the cosmic microwave 
background (CMB) and a particular abundance of light elements, in particular 4 He, in the 
early universe — which is of course altered during the history of the universe. Both these 
phenomenas are observable today, 2 and confirm to a high degree this picture of a hot plasma 
filling the universe. Since Penzias and Wilson first saw a glimpse of the cosmic radiation, 
many experiments have come to the same conclusion. The three latest spaceborne missions, 
COBE, WMAP, and Planck, have, one after the other, measured to unprecedented precision the 
spectrum, and lately there has been a spur of interest in detecting gravitational waves in the 
hopes of information about the inflationary stage — even before the hot plasma! 

Since mid-2000, another probe has also come into light. Baryon accoustic oscillations 
(BAO) are the remnant effects of soundwaves in the primeval plasma, which are supposed to 
enhance the matter correlation function at a particular scale — even in the late universe. Other 
constraints on the model come from more sides than I can hope to do justice here. Large scale 
structure surveys, gravitational lensing surveys etc., all help to constrain parameters of the 
model. Supernova observations have since the late nineties been one of the major players in 
cosmology. They, along with BAO and CMB observations are now the three major pillars of any 
analysis — an analysis of one will usually include the constraints of the others when quoting 
final results. Amazingly, these three observables apparently agree that the universe is indeed 
mostly cosmological constant and cold dark matter. 

In the following I focus on the analysis of supernovae, in particular by performing a 
maximum likelihood analysis to put constraints on the cosmological model parameters. On 
the way, we will look at some of the problems of the standard model of cosmology and the 
standard treatment of the supernova data. I hope to have made the whole thing reasonably self 
contained. 

I first present all the needed statistical tools in Chap. 2. This is followed by a description of 
the cosmology we will look at in Chap. 3 and the observations of supernovae in Chap. 4. Finally 
a presentation of the main analysis and result is in Chap. 3 and some concluding remarks in 
Chap. 6. 


1 This includes all particles of the standard model of particle physics, not just baryons. 

2 Don't mention the lithium problem! [4] 
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Statistics is an old, well studied subject, from which physicists take that everything is distributed 
as gaussians and counting experiments have Poisson statistics. In the present section I hope 
to clarify why this is the case, and to which extent it is true. The main approach will be what 
is now known as frequentist, but Bayesian statistics will also be described briefly. For a vivid 
discussion of the differences between the two, see eg. [5]. 

2.1 PROBABILITIES 

I will start with the basics. We write the probability for some event, call it A, to happen P(A). 
One immediate statement is that the universe is unitary, which is to say that something must 
happen, so the sum of all probabilities must be one: Ya ^(A) = 1 . If the outcome A is dependent 
on some other observation B, we write the probability of A to happen, given B as P(A\B). This 
quantity is in general different from P(A). We can connect the two through summing over the 
possible outcomes of the event B, 

P(A)=£P(A|B)P(B) (2.1.1) 

B 

We may also consider the joint probability of both events A and B to happen, P(AB). We may 
now expand this as the probability of just one of the events happening times the probability of 
the other happening — given the other. In equations, 

P{AB) = P(A\B)P(B) = P(B\A)P(A) (2.1.2) 

The second step follows from the symmetry of A and B. The second equality is known as Bayes' 
theorem. This is what underlies Bayesian statistics — but it is certainly true whether one is 
Bayesian or not. 

If we wish to describe outcomes which are not discrete (like heads or tails) but rather 
continuous, we want to consider instead of just probabilities, a probability density function (pdf). 
To motivate this, consider an infinite number of possible outcomes of an experiment. Then the 
probability for any individual outcome in general vanishes. This is what the pdf sorts out for 
us. Say A is a real number we are trying to predict. Then the pdf f(A) is defined to fulfil 

/ Amax 

f(A)dA (2.1.3) 

^min 

This definition is trivially extended to multiple dimensions by simply extending A and general¬ 
ising the interval. We may write, generally 

P(A e n) = [ f(A)dA, (2.1.4) 

in 

where O is some volume in the space of possible As. As before, the integral over all possible 
outcomes must be 1 by unitarity. We note that by putting in delta functions in the above pdfs, 
we can go back to the discrete picture. Say there are only discrete outcomes A, of A with 
probabilities p,, respectively. I can then write the pdf as 

fi A ) = A i)Pi (■ 2 - 1 - 5 ) 
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What shall interest us most here are continuous distributions, ie. pdfs. The Eqs. (2.i.i)-(2.i.2) 
extend to 

f(A) = J f(A\B)f(B) dB (2.1.6) 

f(A\B)f(B) =f(B\A)f(A) (2.1.7) 

Note the abuse of notation that / may vary according to the argument. If nothing else is explicit, 
it is simply to be understood as the pdf of the argument. 

2.2 EXPECTATIONS 

To any pdf f(A), where A may generally describe a set of multiple parameters, A — {a\, «2 • • • ct n }, 
we define the expectation value 1 of a quantity B(A) as 

(B) = j f(A)B dA (2.2.1) 

Special cases of this are the average p — (A) and variance <J~ — (A 2 — (A) 2 ) of a distribution. 
For some distributions these integrals may not converge, in which case extra care has to be 
taken. A particular, not immediately interesting, average is the following function of k, 

f(k) = (e ikA ) = J f(A)e ikA dA , (2.2.2) 

called the characteristic function. Obviously, this is just the fourier transform of the pdf 2 . The 
significance of this particular function becomes evident when considering sums of random 
variables. Take the sum of the independent random variables {X;}. The characteristic function 
of this is the expectation value of exp ik X, = exp ikY. Writing the exponential in two different 
ways, we see that the characteristic function of the sum is just the product of the characteristic 
functions of the summands, 

f Y {k) = (exp ikY) = n( ex P ikx i) = Tlfa, ( k ) ( 2 - 2 - 3 ) 

i i 

Let's see how this works in practice by some examples. 

The x 2 distribution Consider v independent random variables, all drawn from normal 
distributions. We denote this as 3 


x f ~ j\r (0,1) 

/x(X;) = ( 27 r)" 1 / 2 exp( —Xf/2), (2.2.4) 

We are now interested in the pdf f 2 ofY — Jfl X 2 , called the x 2 distribution with v degrees 
of freedom. We zvill use that we know how to go back again from the characteristic function, 
simply by an inverse fourier transform. First writing down the characteristic function, I 
denote Z, = X 2 , 


h 2 ( fc ) = / n^/z(z.o dz *=n/zw ( 2 - 2 -5) 

J i i 

Note that the expectation value is not necessarily what we expect. Indeed we may have the situation that /((A)) = 0, ie. 
we have no chance of obtaining the expected value! For this reason, one commonly uses average and mean to mean the 
same thing. The most expected value, ie. the value with the highest probability density is called the mode. 

Up to a constant in front of the integral, depending on your convention. 

Seeing X as a vector, I will write X ~ =t- /x(X) = exp(—X r Z _1 X/2) to denote a multivariate 

normal distribution. 



2.2 EXPECTATIONS 


since Y is the sum of the Z,-s, the characteristic function is just the product of the characteristic 
functions of the summands. Nozv we need first the characteristic function for the square of a 
single normally distributed variable 4 . We find for the pdf of Z, 

/ X2 (Z) = J f x (X)S(Z - X 2 )dX = J /x(X) ^~ X j,+ ^ + X) dX 

= ( 2 Z/r)~ 1/2 exp(—Z/ 2 ), Z >0 (2.2.6) 

Where the second equality follows from the identity, 

5 {x - Xi ) 


<%(*)) = E 




(2.2.7) 


where the x ; - are the roots of g. The proof of Eq. (2.2 .y) follows by a change of variables in the 
integral . 5 The characteristic function is then 


f x2 {k ) = Je ikz f x 2 (Z) dZ = (2 ;t) -1/2 J Z ~ 1 / 2 e z ( ik -' l/2 ) 
= (2 ti)- 1/2 [ e^ k -V x2/2 dX = , 1 

v ’ J y /1 - 2 ik 


dZ 


(2.2.8) 


From Eq. (2.2.3) we now see ty multiplication and taking the inverse fourier transform that 


LA) = 


fAr) = fuk 


exp (—ikY) 


In J ""(1-2 ik ) v / 2 


'^ vv (1 - 2iky/ 2 ^ 

This last one is a tricky integral. Anticipating the correct answer, I rewrite it as 


/YN 

V 1 

i z — r 

U, 

' 2 m' J 00 


a Y/2-ikY 


—iY dk 


(Y /2 - ikY ) v / 2 


(2.2.9) 


(2.2.10) 


Here 7 have simply pulled some functions ofY outside the integral and the inverse inside the 
integral. Changing variables to s — —ikY + Y/ 2 , we get 


2 exp(—Y/2) 2 


Y\ 2 ” 1 1 


ico+Y/2 
2 m J— ico+Y/2 


"' 2 ds 


(2.2.11) 


To solve this last integral, we are inspired by how it looks like an inverse Laplace transform, 
[ 6 ], Consider first the integral representation of the T function, which can be moulded to 
look like a Laplace transform by a change of variables, 


POO P 00 

T(z) = / t z ~ 1 e~ t dt = / ( su) z ~ 1 e~ su s du 
Jo Jo 

r(z) 


= / u z ~ L e~ su du = £(ir _i ) 
Jo 


(2.2.12) 

(2.2.13) 


We now invert this and find u z 1 as the inverse Laplace transform of the left hand side, 
^ rzoo+A 


i; 2 " 1 = 


2ni J— zoo+A 


e SM ^ ds 


pioo+A 


1 1 /-zoo+A 1 

=> =7=r —t / e su (su)~ z u ds — -—7 / e s s~ z ds 

r(z) 2 m J-joo+A 2 m J-ico+X 

It is nozv evident from inserting z = v /2 tznd A = Y/ 2 , that zvegetfor Eq. (2.2.9) 


(2.2.14) 


A* 00 - 


2T(v/2) 


. v 1 

y\ 2 _i 


exp(-Y/ 2 ) 


(2.2.15) 


Which is the f 2 distribution with 1 degree of freedom. 

Remember the <5 function only formally makes sense inside an integral. 














6 


STATISTICS 


The y 2 distribution is widely used in statistical analysis, and we shall see why later on. 

Another application of characteristic functions is a derivation of the central limit theorem, 
which goes as follows. 

The central limit theorem This theorem states that asymptotically, the sum of many 
random variables will converge to a normal distribution — almost irrespective of the original 
distributions! We will again use the fact that the characteristic function of a sum is the 
product of characteristic functions. Define Y — YU X;/vN, where the X , are independently, 
identically distributed (iid.) variables, 

/(X 1 ) = -..=/(X N ) (2.2.16) 

We are now interested in /y in the limit N —» 00. Assume first that f has a well defined 
variance a 2 and zero mean p = 0 . 6 Now expand the characteristic function to second order 
in k and write 

Mk) =f(k/VN)» =ny“«^> = (1 - ^+° (^)) N 

+ o(Th)) (2.2.17) 

Nozv we calculate the characteristic function of a general normal distribution, 

/ W =VM) = ^e*p(h^) 

=► f( k ) = j dxe ' kx fiffiffffi ex P ( ^2 ) = exp ( iak (2.2.18) 

Comparing Eqs. (2.2.17) an d (2.2.18), we see that the tivo match if we identify 

p Y = 0 (2.2.19) 

cr 2 = o 2 (2.2.20) 

Thus the distribution of a sum of many iid. random variables converges to a normal 
distribution. This underlies many assumptions made in statistical treatments of errors and 
uncertainties. 


A closely related concept to the characteristic function is the moment generating function. This 
is constructed by simply taking k imaginary in the characteristic function. 


M(k) — J f{x)e xk dx = ( e xk ) = f(—ik) (2.2.21) 

The nice property of this function is that we can, as the name suggests, generate the moments, 
( x n ) of a distribution. Having all the moments of a distribution defines it uniquely 7 . To generate 
the moments, we do the following. 


(. x n ) = J x n f(x) dx = 



(2.2.22) 


We can eg. calculate the first two moments of the y 2 distribution. First, the moment generating 
function is 


M x 2 (k)=f x 2 (-ik) = (l- 2 k )- v/2 


(2.2.23) 


This can always be arranged by simple subtraction. 

This is easily realized with the connection to the fourier transform, which is one-to-one with the original distribution 
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We then find easily by direct differentiation 


<W= S ( i-ar 72 


k =0 


= 1/(1 - 2 k)-( v/2+ V 


k=0 


= 1 / 


(x 2 )^ = 1/ — (1 -2A:)- (l//2+1) 


Jt=0 


= 1 /( 1 /+ 2 ) ( 1 - 2 k) 


_ 9H -h/2+2) 


fc =0 


(2.2.24) 

= 1/(1/ + 2) (2.2.25) 


Recognising a pattern immediately, we boldly write down the general formula for the n 
moment, which can be proven by simple induction. 


n —1 


l ) x 2 — v(v + 2) ■ • • (1/ + 2(« — 1)) = n(v + 2i) 

i —0 


(2.2.26) 


2.3 COMMON DISTRIBUTIONS 

Some distributions are used more than others, and the normal distribution more than any. In 
this section, 1 want to introduce a few common examples of probability distributions. A curious 
property of the normal distribution is that many other distributions asymptotically converge to 
it. We will see here exactly how this comes about. This combined with the central limit theorem 
are the reasons why almost all statistics is carried out with normal distributions. 


2.3.1 The Poisson distribution 


The Poisson distribution describes the probability of obtaining N successes, eg. a number count 
of cosmic rays or photons from some cosmic event, in a fixed time interval, if the average rate is 
fixed and the different successes are uncorrelated. That is, any success is independent from 
another. Call the rate A, then the probability is 

lN 

P ( N;A ) = A7t e_A ( 2 - 3 -i) 


This simply reflects the relative probability of obtaining N successes in the fraction, taking into 
account combinatorics, along with a normalisation e~ , such that E n P(N;A) = 1 . 

We can find the mean and standard deviation by direct summation. 


00 OO A N 

(N) = £ NP(N; A) = Ae_A E TTi = A 

N N ' 

00 00 1JV 

(N 2 ) = £ N 2 P(N; A) = Ae- A £(N + 1 ) ^ 

N N iN/ * 

00 \ N 

=Ae- A (A + l)£—=A(A + 1 ) 

N 

=> a 1 — A 


(2.3.2) 


( 2 - 3 - 3 ) 

( 2 - 3 - 4 ) 


Now let's take the limit A> 1 . This means the mean, as we just calculated, is also very large, 
and we allow ourselves to expand around it, parametrising the distribution with the continuous 
N(S) = A (1 + S), where the region of interest is |< 5 | <C 1 . Before things get interesting, we need 
an intermediate result, known as Stirling's approximation. This is basically an expansion of the T 
function defined above in Eq. (2.2.12). Since n\ —T(n + 1 ), we have 


«! = 


x n e x dx = 


e n log x-x dx 


( 2 - 3 - 5 ) 
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P(N;A) 



Figure 1: Examples of the Poisson distribution for various values of A as described in the 
legend. 


Now I expand the content of the exponential around the maximum at xq = n. This becomes 


nlogx — x « nlogn —n + —(x — n ) 2 
Inserting this into the integral, we have 

n 0 ° j 

n\ « n n e~ n / e^"") /2n dx w n n e~ n V 2 


Tin 


(2.3.6) 


( 2 - 3 - 7 ) 


JO 

where the last integral is done taking the lower limit to minus infinity, as we take n> 1 . Now 
put all this back into the distribution function. 


m\) 


\ n 

N N e~ N V 2 nN 


.—A 


1 { AS 2 

7^ exp W 


exp {AS - (A[l + + 1 / 2 ) log(l + < 5 )} 


= vhx exp 


(■N-A) 2 ] 
2 A / 


(2.3.8) 


where the last approximation expands the content of the exponential to second order in S and 
uses A > 1 > /. We finally see here the result we might have anticipated, we simply insert the 
mean and variance of the Poisson distribution in the normal distribution to get the asymptotic 
expression for the former. 


2.3.2 The binomial distribution 

This distribution comes about when looking at binary outcomes of a repeated experiment, like 
a series of coin flips. If the probability of the coin landing heads is p, then after N experiments, 
the probability of obtaining exactly n heads is 

P(n; N, p) = p n { 1 - p) N ~ n (2.3.9) 

The first factor on the right hand side is the binomial coefficient 

fN\ N\ 

n\(N — n)\' 


n 


(2.3.10) 
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which takes care of the combinatorics of the different orders of obtaining the n heads. Note that 
here we have a fixed number of repetitions, where in finding the Poisson distribution, we had a 
fixed time interval. 

P(n;N=10,p) 



Figure 2: Examples of the binomial distribution for various values of p, but fixed N — 10 . 


We find again the mean and variance 

N N 

( n } = X] nP(n;N,p) = NpJ^ 


(N 1)1 -p ,!_1 (l — p) N ~ n 


n —0 




N-l 


= Np £ P(n;N-l,p) = Np 

n —0 
N-l 


(2.3.11) 


(n 2 ) = Np ( n + 1 )P(n;N — 1 , p) — Np([N — 1 \p + 1 ) = (Np) 2 + Np( 1 — p) 

n —0 

=> o 2 — Np(l - p) (2.3.12) 

Now consider the double limit N —> oo, p —> 0 with the product Np = A fixed. Rewriting 
the probability distribution using n«N, we get 


P(n; A) = lim 


N\ 


A 


1 - 


A 

N 


N—n 


N->oo n\(N — n)\ \N 

A” N ■ ■ ■ (N — n + 1 ) ( A 

—- lim -—- 1 — — 

n\ N^oa N n \ N 

A" 


N—n 


.-A 


n! 


(2.3.13) 


which is just the Poisson distribution. That means that for a large amount of trials with 
vanishing probability per trial, the binomial distribution looks just like the Poisson distribution. 
This makes sense, since we can exactly interpret the infinite trials as being done in continuous 
time with vanishing probability, such that Np is the rate of success. Taking A 1 of course 
brings us to the gaussian limit again. 
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2.3.3 The x 2 distribution 


We have already seen what this distribution is, along with its moments. Here I quickly show how 
also this distribution asymptotically looks like a gaussian. I again use Stirling's approximation 
to write, in the limit v —t 00, and writing temporarily x — v{\ + 6 ), 




'/Anv \v/2 

1 p -vS/2+{v/2- l)log(l+<?) _ 1 

v /2tt(2v) V 2tz ( 2v ) 

(x — v ) 2 

-- exp ' 


exp 


v 5 2 

T" 


\/ 2ti ( 2v ) 


2 (2v) 


(2.3.14) 


which again is simply a normal distribution with the expected mean and variance. 


f(x;v) 



Figure 3: Examples of the x 2 distribution for various values of v. 


2.4 PARAMETER ESTIMATION 

An ideal theory will naturally explain all constants involved in it. That means we would very 
simply be able to compare predictions of this theory with an experiment. However, this is 
usually not the case. What happens most often is that a theory will contain some unexplained 
parameter(s), which must be fitted. Supposing the model is true, we can then constrain the 
parameters of the theory with a particular experiment. This notion of fitting is what the current 
section explores. 

We generally have some experiment, which produces random numbers — due to noise in 
the experiment or intrinsic variability in the source. How do we compare our model of the 
experiment to the data produced and in the process fit the parameters of the model? In general 
these are two different problems, but by the method we are going to use, they can in general be 
solved simultaneously. The majority of the current section will be about the likelihood and in 
particular maximising the likelihood, along with finding estimators of the model parameters. 
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The likelihood is defined as the pdf of the data, X, 8 given a specific model, which I 
generically denote 0 , 


m=f(x\e) 


(2.4.1) 


Note the funny semantics — it is indeed not a probability density of the model, but we still 
want to link it to some notion of model selection by probability. This has the potential to 
confuse. One easily avoids this by simply stating what the likelihood is, and never using it as a 
probability of the model [5]. Note right away that the likelihood is itself in general a random 
variable, as are the estimators we are going to derive from it. 

We now define the maximum likelihood estimators (MLE), 6 , to be the model parameters, which 
maximise the likelihood given the obtained data, ie. 


d£(§) 

d 6 


— 0 


(2.4.2) 


These estimators generally have nice properties. The most interesting properties can be found 
exactly in the context of linear models, which is what I discuss next. In the limit of infinite 
datasets, these properties extend to non-linear models. I will not discuss this in detail, only 
illustrate it with an example. For a complete description of the problem and its solution, I refer 
to textbooks on the subject, eg. [7]. 


2.4.1 Linear models 


Consider a model describing a dataset {x,,!/,}, i — 1 ... N as 

M 

y;(*i) = Yj a i A Mi) ( 2 -4-3) 

;=i 

where M < N and the functions Aj are fixed and linearly independent, ie. YLj a j A j{ x i) — 0 => 
aj — 0 . These Aj could be monomials, sines and cosines etc. Now assume we measure x with 
negligible uncertainty and y with some known uncertainty, which we take to be gaussian, ie. 
y, — yi + e,, where e,- ~ A/"( 0 , l) 9 . We can now write the likelihood, 

{ l N / M 

~2 E 

The constant of proportionality just normalises the likelihood. Now we want to maximise this 
likelihood as a function of the ajS — the unknown model parameters. Because the exponential is 
a bit unwieldy, we take the log and a factor —2 out, and instead of maximising C, we minimise 
— 21 og£. The reason for this will hopefully become clear. To find the minimum, we simply 
solve for the differential to be zero. 10 Doing this, we get a set of M equations for the M aj s. 



3 (-2log £(/?,)) „ / M 

— = 0 = -2 Y.Aj^i) Ui-t “f A j'(xi) 


da; 


( 2 - 4 - 5 ) 


i '=1 


Since we know linear algebra, and this looks an awful lot like it, we drop the indices and 
see everything as vector-/matrix products. I explicitly define the elements of the matrix A as 
Aji — Aj(x,), and the sum now looks like 

0 = A(y — A T d) =>- a — ( AA T )~ 1 Ay (2.4.6) 

8 Hatted variables will generally be either observed data or estimators — both of which are random variables. Unhatted 
will usually be the corresponding true variable. 

9 It is always possible to absorb the variance of e into the As and thus have unit variance 

10 And show that it is indeed a minimum, not a maximum or saddlepoint. 
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The matrix A was defined to have linearly independent rows, which in turn means the inverse 
of AA t exists. The proof of this is as follows. Define S — AA T . Any positive-definite matrix is 
invertible, so I want to show S is positive definite. We have straight forwardly that for any X, 

X T SX = X t AA t X = \X t A \ 2 > 0 (2.4.7) 

which shows it is positive semi-definite. Now we need to show that if the product is exactly 
0 , then so is X. Remember the functions Aj were assumed to be linearly independent, which 
means 

a T A = 0 => a = 0 (2.4.8) 

This is exactly what we need, since if we write 

0 = X T SX = X t AA t X = \X t A \ 2 =>• X t A = 0 =>- X = 0 (2.4.9) 

This means that S is indeed positive definite and the inverse (AA r ) -1 exists. 

Now we are interested in two things: the distribution of —2 log C(dj) and of the estimators 
aj, under repeated (thought-)experiments 11 . We first look at the likelihood. 12 

- 21 og£(fl) = | y-A T a \ 2 = | y-A T (AA T )~ 1 Ay \ 2 

— (In - A T (AA T )~ 1 A s j y (2.4.10) 

Here P = A T (AA T )~ 1 A is a projection in the sense P 2 X = PX for any X G 1 R N to an M 
dimensional subspace. By an orthogonal transformation, we can rotate the y to y — Oy 
such that the projection Py has its elements only in the first M entries, ie. P(j/i,.. •, J/n) T — 
(yi,..., yu, 0 ,..., 0 ) T . Note that since the transformation is orthogonal, we also have y, ~ 
JV ( 0 , 1 ). Taking now y = (Ijy — P)y = ( 0 ,... , 0 , 1 /m+I/ • • • the likelihood takes the 

following form 

N 

—2 log C(a) — y T y — fi ~ xI=n-m (M-“) 

i=M +1 

This result is the origin of two notions, which are often abused in practice. The first is, we 
simply call —2 log £ the chi squared, y 2 . This may result in a bit of confusion since now one has 
a random variable called y 2 , which is y 2 -distributed, ie. its pdf is the y 2 distribution. The other 
is the idea of a reduced number of degrees of freedom, v = N — M, ie. the number of data points 
minus the number of fit parameters. These ideas are widely used even when the model is not 
linear. 

Now we turn to the distribution of the estimators d. We have already seen the result, which 
is 


d — (AA T ) 1 Ay^d~Jf(a,(AA T ) 1 ), (2.4.12) 

where the normal distribution is to be understood in the multivariate sense. We see here a 
specific example of a more general result. The MLE is normally distributed around the true value 
— it is unbiased — with covariance matrix described by 13 



(2.4.1.3) 
(2.4.14) 


11 Of course there is only the one actual experiment, but we might imagine performing it again and again. It is under 
these repetitions that the estimators are random variables, whose pdfs we want to find. 

12 Note that I have already thrown away a constant normalisation term. This only shifts the distribution, or rather, the 
distribution we find is that of — 21 og(£\/ 27 r ). 

13 For two matrices A, B, we write A > B if A — B is positive semi-definite. A proof of this inequality comes later. 
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where the average is taken over repeated experiments. X = AA T is called the Fisher Information. 
In this case, the double derivative is a constant, so the average is trivial. This bound on the 
covariance matrix is called the Cramer-Rao bound, and is the minimal covariance for unbiased 
estimators. An unbiased estimator with this minimal variance is called efficient. We see that the 
MLE for linear models are all exactly unbiased, normally distributed, efficient estimators for all 
N. 

The linear models are nice because, as we have just seen, practically everything can be done 
analytically. This gives us a nice starting point for the next discussion. For a general, non-linear 
model, the results in the example are no longer valid. Let us explore finite sample sizes with a 
very simple example. 


2.4.2 A non-linear model 


Consider the data set {£,}, i — 1 ... N, drawn from a normal distribution with unknown mean 
and variance, but with no measurement uncertainty, x t ~ A f(] 4 ,cr). The likelihood for this 
experiment is 


C — ( 2 ncr z ) N/2 exp 



(2.4.15) 


and we are trying to determine ]i and <r 2 . Note how we cannot neglect the normalisation this 
time, since we are now fitting cr. The maximum point ( fi, a 1 ) is 

(2.4.16) 

i 

a 2 = £(i; - fi ) 2 (2.4-17) 


Now consider the distribution of these estimators. The fact that we don't know a complicates 
things, since this is what set the scale for us before — we could measure deviations in terms of 
a fixed number. Now this scale is a random variable. For instance, we immediately see that 
fi J\f (ji, <7/ \/N), but here we've used the unknown a to define the variance. 

We turn therefore first to the distribution of the variance cf 2 . I first write out the ft and 
rewrite the sum, giving 

d 2 — N ~ 2 ~ x j ) 2 (2.4.18) 

ij 


We now need a small trick to evaluate this sum. What we really want — anticipating the answer 
— is something like a sum of squares Jf XjXj, not of squares of differences, as we have. So we 
recast it to 


0 2 = N 

ij 

and find the matrix C we need here is 

/I — N -1 —AT 1 
c = -N- 1 1 - AT 1 



(2.4.19) 


(2.4.20) 


We now pseudo 14 Cholesky factorise C, ie. we find an upper triangular matrix If, which satisfies 
U T U = C. This matrix is 


f s/(N -i)/(N + l- i) i = j 
Uij - < -l/V(N-i)(N + l- i) i < j 
{ 0 i > j 


(2.4.21) 


14 Pseudo since strictly C is only positive semi-definite. 



14 


STATISTICS 


We now use U to find the rank of C, which determines the pdf of the sum. Taking the reverse 
product, we see that 


UU T 


n o 
o 1 


V 


\ 


1 o 
0 0 / 


(2.4.22) 


which immediately tells us the rank of C is N — 1 . This means U is almost an orthogonal 
transformation — we just lose one degree of freedom. Thus we will define new variables 
yj = UjjX,, j = 1 ... IV — 1 , which are also drawn from independent normal distributions. 
The variance is now given as 


kjXj 


Ncf — E XiCijXj -h ft/:/tf/c/ 

ij ijk 

N—l 

= = E y \ ~ ^=n-i 


(2.4.23) 


This shows that for finite N, the estimator is a bit off, as 



= v 


N — l 
N 


(2.4.24) 


This comes about because we fit the mean while calculating it. The missing degree of freedom is 
of course the mean fi which we now consider. Had we known cr, we would immediately write 
VN(fi - y)/a Af{ 0 , 1 ). Exchanging d for a, the distribution changes a bit. We may write 

\TN{fi-}i)/cr = ” (2.4-25) 

where n is normally distributed n ~ Af( 0 , 1 ) and c follows a y distribution, c ~ Xv=N-l - 15 
Note how this combination exactly cancels the dependence of (J. This particular combination 
of random variables follows a distribution known as Student's t-distribution with v — N — 1 
degrees of freedom. Its pdf is 






v+1 

2 


(2.4.26) 


We are now in a position to understand the At —> 00 limit of the MLE. We see that for finite 
N, neither of the two estimators follow a normal distribution, and a 2 is even biased. In the 
asymptotic limit though, both distributions are normal, and we have 

VN(fi - }i)/d ~ Af(0,l) => fi ~ \TN) (2.4.27) 

Ncr 2 / cr 2 ~ A f(N, VlN) d 2 ~ After 2 , y E cr 2 ) (2.4.28) 

It is only in the asymptotic limit the estimators follow an unbiased normal distribution, with 
variance given by Eq. (2.4.13). As I showed earlier, many distributions tend to a normal 
distribution for large N. This is what is happening here too. In this limit, the likelihood tends 
to a normal distribution, for which the results from the previous section hold. 


15 The x distribution is simply the distribution of the square root of a x 2 random variable. 
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2.4.3 Cramer-Rao lower bound 

Now let us see how the Cramer-Rao bound appears. I will follow the proof from [7]. Assume 
we have a set of unbiased estimators {gi}, i — 1 ... r, of the quantities {y ,}, ie. (gj) — g t . The 
likelihood function generally depends on some parameters, say 9j, j — 1 ... k. We now construct 


another set of variables, }, and build the r + A:-vector {yj... g r , The 


covariance matrix of this vector is 



(2.4.29) 


Where E^ is the covariance of the estimators g, X is the Fisher Information and 



By construction, this covariance matrix is positive definite. Furthermore, we have that 



(2.4.3!) 


since the Fisher Information is positive definite. This is seen easily since we can rewrite it as 



d log C d log C 
d 6 j d 6 : 



(2.4.32) 


By multiplying the two matrices, we see that 


1 

-AX - 1 


E* 

A 


E f - AXA t 

0 


_ 1 . T 

0 

X” 1 

X 

A^ 

X 

— 

S 1 T 

X~ x A T 

1 

— 

E g — AX 1 A 1 


which holds for any subset of the estimators g. From this it follows that all eigenvalues of 


E g — AX ] A t are positive or zero, or equivalently that the matrix is positive semi-definite. 
Looking at unbiased estimators of the 9s, we see that A reduces to an identity matrix and the 
bound dictates the matrix Eg — 2 T 1 is positive semi-definite. This is exactly what is meant in 
Eq. (2.4.13). 


Note however, that in deriving this bound, we rely on the estimator being unbiased. It is 
easy to think of estimators with lower variance, say g — 1 . This has obviously zero variance, but 
is not a particularly good estimator of anything. It is also worth noting that this bound does 


not require that the estimator follows a normal distribution. It sets a bound on the variance of 


any unbiased estimator. However, it is only a lower bound, and by no means a guarantee — only 
in special cases, like the MLE of a linear model, does an estimator saturate the bound exactly. 

2.4.4 Confidence regions 

Having found the distributions of the estimators of the parameters of a theory, I now want 
to define the notion of confidence regions. Loosely speaking, these are regions in which we are 
confident the true value of the parameter lies. This confidence is usually defined in terms of a 
coverage probability, p c . That is, if we define our confidence regions in the same way in repeat 
experiments, then for every repetition we have the probability p c that 9 is inside our confidence 
region. The usual objection here is that once the experiment is done, we can no longer speak of 
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a probability that the true 0 is inside or outside the confidence region — it either is or is not! 
The probability as such is defined prior to the experiment. This distinction shall not worry us 
too much. 

To begin the discussion on confidence regions, we have to understand the concept of a 
p-value, which is closely related to the coverage probability. This is very simple. The p-value 
of some event is the probability of seeing something more extreme or as extreme as what is 
observed. In different scenarios this may be computed in a variety of ways, depending on the 
difficulty of the problem at hand. In some cases, p-values can be computed analytically, while 
for others one resorts to Monte Carlo (MC), ie. random simulations. As such, the p-value is 
entirely dependent on the model being tested, and is only telling us how unlikely something is, 
given a specific model. Let us see how this works in an example. 


A fair coin? Consider tossing the same coin N times. We now ask ourselves the question 
"is the coin fair?", and we can address the answer with a p-value. Say the coin lands heads 
up M times, where without loss of generality, M > N/ 2 . To calcidate the p-value, we now 
simply add up the probabilities of getting M or more heads when tossing a fair coin, 


V 



0 . 5 m 0 . 5 N ~ m 



— 0 . 5 n 



2 F 1 ( 1 ,M-N ,1 + M;- 1 ), 


(2.4.34) 


where 2 fi is the hypergeometric function, whose form is not particularly enlightening. To 
make things more clear, let's take a specific example. In Fig. 4 I take various values for N 
and plot the p-value one zvoidd obtain as a function of M. The line across denotes the custom 
95 % confidence level, ie. everything under the line is excluded at more than 95 % confidence. 
It is evident that the as N goes up, we need a smaller and smaller relative deviation from 
M — N /2 before we can exclude that the coin is fair. 


p-value 



M/N 


Figure 4: p-value, given by Eq. (2.4.34), °f different outcomes M from tossing a coin N times 
for different values of N as labeled in the legend. This tests the hypothesis that the coin is fair. 


Originally we wanted to constrain our parameters. With the p-value at hand, we just need 
Wilks' theorem, which tells us the distribution of a likelihood ratio in terms of a xf distribution. 
This was first shown in [8]. First I go through the proof of the theorem, and following that, we 
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will see how this constrains our parameters through confidence regions. I will here just look at 
a linear model, and I simply argue that the results we find extend to non-linear models in the 
asymptotic limit — and that we abuse this fact and use Wilks' theorem always. 

Consider the type of model from Sec. 2.4.1. Take a space for the possible coefficients. Q, 
and a subset i £ O of dimensions N and M respectively, so 0 < M < N. Now call the true 
parameters a a = {a±,a w }, where a w G a>,a ± G T = co . We can see _L as the remaining part 
of Q, when we fix a w . Now we have both the MLE a a = {a \ , a w } G fi and a restricted MLE 
£ -L, which satisfy 


81og£(fl n ) = Q 

da, 

dlog C{a^,a w ) = 
dcij 


i = l...N 
i — 


( 24 - 35 ) 

(2.4.36) 


The quantity C p {a u ,) = £(a±,a w ) is called the profile likelihood, a j_ is given by 


( 2 - 4 - 37 ) 


where I have partitioned the Fisher Information as 


Now I define the likelihood ratio 




, _ £(^J_/^a>) 

£(«n) 


(2.4.38) 


(2.4.39) 


and seek the distribution of this under the hypothesis that a w are indeed the true parameters. 
Take —2 log of this and insert factors of the true likelihood £(aq). 


21ogA 2log +2l °% £(an) 

Each of the terms on the right hand side can be reduced to the forms 

£(«i _ /* „ \T- 


—2 log : 


2 log 


£( fl n) 

£(«n) 


£(«n) 


= “(fll -A_l) Z_L(fl± ~«±) 
= («n ~ «n) T ^n(«n - «n) 


(2.4.40) 

(2.4.41) 

(2.4.42) 


This is seen by simply inserting the MLE, Eq. (2.4.6) into Eq. (2.4.4) an d collecting terms. Now 
write the derivative of the log-likelihood at the true parameters an, split into the _L and co parts 
as 


(V\ = ^log^(fln) 

\tJi 

This gives two expressions for // and one for £, 

(j^j — ^n(«n — a o) 

V = I±{a±~a±) 


(2.4.43) 


(2.4.44) 

(2.4.45) 


Remember, since the estimators follow the distribution in Eq. (2.4.12), these variables follow a 
normal distribution (77, £) ; - ~ A/"( 0 ,Xq). Inserting this into Eq. (2.4.40), we have 


T 

—2 log A = (**) 


(2.4.46) 
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Using the following block inversion identity 

T -! _ (Xf 1 + T-^1{X W - X T Xl l X)- x X T Xl l -XI 1 X(X W - 
n 'v -(Xu-X 7 !- 1 !)- 1 ! 7 !- 1 {X^-X 7 !- 1 !)- 1 ) 

we can write the first product in Eq. (2.4.46) as 

+ (X T X~hj - Z) T (X U , - X T J I 1 J)- 1 (J T J -'t, - £) 


(2.4.47) 


(2.4.48) 


The first term here is subtracted in the likelihood ratio, and we have 

—2 log A - (X T Xl\ - 0 T (X u ,-X T X:- 1 X)-\X T X- 1 r 1 - £) (2.4.49) 

This combination of variables, X 7 Xf 1 ;/ — £ again follows a normal distribution, for which the 
covariance is easily seen to be 

((X T XJ^ - ^(X 7 !-^ - £) t ) - (tf 7 - 2 X t X"V t + X 7 Ifh 1 V 7 If 1 l) 

={X w -X 7 Xf 1 X) (2.4.50) 

Meaning the likelihood ratio is simply the sum the squares of N — M — the number of fixed 
dimensions — independent gaussian random variables 

—2 log A ~ Xv=N-M (2-4-51) 

To test the hypothesis that a w are the true parameters, we now simply find the p-value of 
getting the particular —2 log A value for that a u ,. This p-value is given by 


p-value 


/— 2 log A 


xI=n-m ( x ) dx 


(2.4.52) 


To illustrate this, let's look at an example. 


Constraining a one-parameter linear model Consider drawing from a gaussian 
distribution with known variance, say a — 1 , but unknown mean y. The likelihood is of the 
form Eq. (2.4.4), specifically 


C{y) <x exp 




( 2 - 4 - 53 ) 


and we want to say something about y given some experimental result. For a particular 
outcome of the experiment, say N datapoints, we use Wilks' theorem in the following way. 
We take as Q the full range of the y,for zvhich we find the MLE as 

y ( 2 - 4 - 54 ) 


and for every possible value of y, we take w as just that y. Since there are no parameters left, 
the restricted MLE in _L is trivial. The p-value is calcidated according to Eq. (2.4.52), 

p-value(y) — / Xv=i( x ) dx where (2.4.55) 

J -2 logA(fO 

- 21 ogA(f<) = -2 log [C{y)/C{y]] = N{y - yf (2.4.56) 


I now choose to look at the values y n = y(l±n/ yN) for various n. This gives us the 
integral, for n = { 1 , 2 , 3 }, 

f-OO 

p-value(y n ) — / xl=i( x ) dx — { 0 . 32 , 0 . 046 , 0 . 0027 } 

Jn 2 


( 2 - 4 - 57 ) 
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Or in ivords, we can exclude these values with confidence { 0 . 68 , 0 . 954 , 0 . 9973 }. Say we want 
to he at least 68% confident, then our confidence region is fi ± = [fi( 1 — —U), p(l + 

)—)], ie. no values inside this interval can be excluded with confidence greater than 

68 %. 

Because of the gaussian nature of the likelihood ratio, this limit is usually called the 1 -a 
confidence interval, as it is exactly one standard deviation away from the mean, and the 
standard deviation is usually denoted a. We can in the same fashion construct the n-a 
interval for the other ns. 

The previous example simply shows the general use of Wilks' theorem. Another subtle thing 
we can do is to eliminate parameters, which are not of immediate interest. Such parameters 
are usually called nuisance parameters. To see how this works, we just have to have one more 
parameter. The following example is trivially extended to N parameters of which M < N are 
nuisance parameters. Unfortunately the 2 dimensional nature of paper only allows for easy 
visualisation of 2 dimensions. 

Eliminating nuisance parameters Consider a two-parameter linear model with the 
general likelihood, in vector notation, 

£(fl) ocexp —A T fl) 2 j (24.58) 

with a = {aj_,a u ,} and y G R N . As stated before, the MLE is given by Eq. (2.4.6), 
a = (AA T )~ 1 Ay. First, let's do the same thing we did before, and let (1 be the entire space 
of a, while co fixes both parameters, ie. Q = co. That makes the likelihood ratio 

—21ogA (a) = (a — d) T l(a — d) (2.4.59) 

a random xl=2 variable for which we again calculate p-values according to Eq. (2.4.52). 

Noiv one of the parameters a j_ is a nuisance parameter. This means that we only fix a w , and 
find the constrained maximum over a j_. So we look at the quantity 

—2log A(a w ) = -2log (2.4.60) 

Noiv, by Wilks' theorem, this quantity is a random Xv=i variable. The last two points 
are illustrated in Fig. 5. For the sake of illustration, the parameters are taken to be very 
correlated. 

We see that the question which Wilks' theorem helps us answer is if we can confidently exclude 
some parameters a^ for all values of the remaining parameters a j_. Even if there is just a single set 
of parameters {a \ ,a u , \ such that the p-value is big enough, ie. —2 log A is small enough, then 
a w cannot be excluded. From Fig. 5 we see exactly how for a w = \/ 2 , we only have — log A < 1 
when a± — 1 . This still means a w — \/2 cannot be excluded at la. Said differently, for every a w 
we test the hypothesis that this is the true value, regardless of what the a parameter is. 

2.4.5 Marginalisation 

In the previous derivation, I strictly refer to maximisation of likelihoods. Even so, one will 
often encounter the term marginalised likelihood. The use of this should be kept to a minimum 
outside Bayesian reasoning, which is described briefly in Sec. 2.6. Marginalising the likelihood 
in simply integration instead of maximisation. That is, instead of using C p , we define the 
marginal likelihood 

C m (6) = j d(p 


(2.4.61) 
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Figure 5: Illustration of confidence regions for two parameters with X L =X to = \,l = \/y/l. 
The dashed contour shows the 68% confidence region of both parameters, while the dotted lines 
are the boundaries of the 68% a w confidence interval, taking a to be a nuisance parameter. 
As shown, these dotted lines mark the extreme a w for which any a gives —2 log A < 1. The 
number 2.3 is the solution y to the equation j™ xl=i( x ) dx — 1 — 0.68. For higher dimensions, 
one could also give the boundaries of the joint contour in lower dimensions — here the boundary 
would be at ± \/2.3 instead of 1. It is important though, to remember the difference in meaning. 
The bigger one also contains information on the other parameter, while the small one take all 
but a w as nuisance parameters. 


A trivial exercise is to show that the confidence regions determined from this quantity is 
in general not the same as one would get with the profile likelihood. The objection is now 
that, obviously, the marginal likelihood is not reparametrisation invariant, ie. for some other 
parametrisation of the nuisance parameters <f> = /($>), 

J C(6,<p) dipyt j £(0,<J>) d<t> (2.4.62) 

The two integrands differ by a jacobian / — dip/ dO. This means that when you pick your 
parametrisation for the likelihood, you assume in some sense that this is a good parametrisation. 
This again reflects the issue that the likelihood is not a pdf of the model — that is why the 
meaning of this integral is not immediate. 

Now it is an equally easy exercise to convince oneself that the maximisation procedure 
is completely free of this caveat. The maximum likelihood for some 0 cannot depend on the 
chosen parametrisation of (p , so obviously ma x ( p£(G, <p) = max t |, £,((), <b). 

2.5 MONTE CARLO METHODS 

The previous sections have mostly described linear models, and in one case a very simple 
non-linear model, whose answer can be found analytically. This, unfortunately, is not always 
the case. For some random variables, it can be impossible to find explicit expressions for their 
distributions. When this happens, as is often the case, one way around it is to simply simulate 
the distribution. This approach is broadly called Monte Carlo (MC) methods, and underlies 
many results of modern physics. The approach can also be applied to numerical evaluation 
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Figure 6 : Example of MC integration. Each point is drawn at random. In this case, 
N — 10 3 , M = 781. This means the la confidence interval for the integral is approximately 
(7 t/4)mc — 0.781 ± 0.013, compared to the true value, 7t/4 = 0.7853, we see that this is indeed 
a reasonable estimate. 


of integrals. To see this, let's go through the classic example, where we find n ss 3.14 by MC 
integration. 

Estimating n We know the ratio of areas of a unit circle to a square with side length 2 
to be 7t/4. Now as an exercise we want to find the value of this numerically. We look at a 
single quadrant, x e [0,1], ye [0,1], where the ratio of areas is the same. We now draw N 
points inside this region and for every point check if it is inside or outside the circle. So for 
every point, check if x 1 + y 2 < 1. Finally, we count the number inside the circle, call it 
M, and divide by N. The ratio M/N estimates 7t/4 (since the region from which we draw 
has unit area). 

Nozo, since we are doing this as MC, the estimate zoe get has an associated error, zvhich zve 
must also estimate. Namely, for every point zve drazv, it has the probability tz/A to be inside 
the circle. That means M will be binomial distributed with p — ti /4 with N drazvs. From 
our previous caladations ( 2 . 3 .n)-( 2 . 3 .i 2 ), we get immediately 

(M} = N ■ 7 t /4 ( 2 . 5 . 1 ) 

a M = N ■ 7 t/ 4(1 - 7 t/ 4 ) ( 2 . 5 . 2 ) 

or, if we look at the quantity M/N, and approximate the binomial zvith N very large as a 
gaussian, 


M/N ~ J\[{tz/A, yj 7t/4(1 — tz/A)/N). ( 2 . 5 . 3 ) 

We see here a very general (approximate) residt: the error on the estimate falls off as VN 
So, not surprisingly, the larger we take N, the better the approximation zve get. This is 
illustrated in Figs. 6 and 7 . This technique is in its most naive form extended trivially to 
any integral in any number of dimensions. Of course, as the parameter space becomes larger, 
computing time increases, but the basic picture remains. 
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|M/N - rr/4| 



Figure 7 : Errors from the computation of n by MC integration. We see that all the errors are of 
expected magnitude (notice that it is plotted on log-log axes). For every N, I perform 10 MC 
simulations, simply to show the intrinsic variability in the estimate. 


So we can do integrals numerically. This is comforting! As mentioned earlier, we also 
might want to find distributions for which we cannot find an analytic expression. This is 
heavily used when finding p-values for some non-trivial quantity. What one does is to simulate 
an experiment a number of times, say N, and for every simulation find the desired quantity. 
The distribution of these simulated quantities then answers the same question as would the 
analytic expression, given this model, how (un)likely is the observed outcome, simply by numerical 
comparison between the MC results and the real experiment. I will now extend the previous 
non-linear model of Sec. 2 . 4.2 very slightly, and we shall see that we immediately lose the 
analytic expression for the estimators. We will then use MC to regain control. 

Unequal errors on measurements Take again the estimation of a normal distribution 
with (p,cr 2 ) — (0,1), but this time add distinct measurement errors, U[, on all XiS. This 
means the likelihood is 


N 


C — Y\(2.n[cr 2 + of]) 1/2 exp 


1 (Zj - v ) 1 

2 a 2 + of 


Looking for the MLE ( fi,a 2 ) of this model, we get 


F = 


E Xi/(a 2 + af) 

El/(l7 2 + cf) 


£l/(b 2 + n 2 ) = £ 


(*i ^ v ) 2 

( Zr2 _l_ rr2 \ 2 


(2-5-4) 

(2-5-5) 

( 2 . 5 . 6 ) 


The appearance of cr, in these sums prohibits the nice manipulations we could do before, 
and at this point we're stuck on the analytic side. What we do is to simply solve these 
two equations numerically, for a number of simulated experiments and find an empirical 
distribution. It is immediate that the distribution of Cj has a lot to say about the distribution 
of the MLE. 

Nozv let's do the concrete MCfor two different experiments. The only difference between the 
two is the distribution of the individual, known errors cy We will take N — 100 datapoints 
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in every experiment, and 10 4 simulations. The first experiment is just like the old one, we 
take all cr, — 1 equal. The other has uniformly distributed errors Vj ~ lf(0.1,1.9), and 
(af) — 1.02, ie. almost 1 like the other. The exact distribution is not of huge importance. 
Noiv let's see what difference this makes. Simulating the experiment 10 4 times, zve get the 
distributions shown in Fig. 8 . We see that while both are hitting the right answer on average, 
the tails are different in the distribution of a 2 . 




Figure 8 : Distribution of fi and dr 2 from 10 4 MC simulations. Orange shows the original experi¬ 
ment with only the same errors, while blue shows the distribution with errors n, distributed 
uniformly between 0.1 to 1.9. We see clearly that while the distribution of the mean is more 
or less unchanged, the distribution of cr 2 is altered, and no longer follows the x 2 distribution 
derived earlier. The two histograms have the expected distribution for the original experiment 
superimposed. 


Another interesting distribution to see from this experiment is the distribution of the 
X 2 — Yf{xi — fi) 2 / (a 2 + of). This is shozvn in Fig. 9 . It is immediate that the x 2 
distribution does not describe this distribution very zvell. We can interpret this as exchanging 
variability is the x 2 for variability in the a 2 . Had zve set all of <C o 2 , then we end up with 
the situation from Sec. 2 . 4 . 2 , and the x 2 is alzvays perfect, and all variability is in the o 2 . If 
we instead have of o 2 , then all the errors are practically fixed and we end up with an 
almost linear model, ie. the o 2 does nothing to the fit, and we just fit p. This gives us a fixed 
o « 0 and a x 2 which is distributed, well, as a x 2 - The situation here is a kind of middle 
ground, zvhere both are of the same order, and so the x 2 holds some of the variation, while 
also the o 2 varies. 

Most importantly, this shows that when the errors on the datapoints are not equal, the MLE 
is not always a perfect fit, ie. x 2 7 ^ N. Even when fitting the error, some variation remains. 

PDF 

0.07 



Figure 9 : Distribution of x 2 — — }i) 2 /(cr 2 + of) from MC simulations with distinct errors. 

Superimposed is a x 2 distribution with 100 degrees of freedom. 


These two examples show the very basics of MC simulations, and the types of problems they 
solve. This section is by no means exhaustive. It is mostly meant as a very soft introduction to 
the subject of stuff zve can't calculate exactly, which unfortunately is a very big one. 
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STATISTICS 


2.6 BAYESIAN STATISTICS 

All statistical analysis in our work \s frequent is t. An objection to what I have shown so far, as I 
already mentioned, is that the p-values we get out are not probabilities in the sense we would 
like them to be — they do not represent model probabilities. If one is unsatisfied by this, then 
we may use Bayes' theorem to go from the likelihood, which is the pdf of data given a model, 
to a posterior pdf say f(0 |X), which is the probability of a certain model given the obtained data. 
By Eq. ( 2 . 1 . 2 ), this is done as 

/(»!*) = ™ (,M 

where, given f(9), f(X) = f C(9)f(9) d9. f(9) is called the prior, and /(X) is called the evidence. 
Note however, that using Bayes' theorem requires a prior, for which we in most cases of interest 
in fundamental physics have no idea what should be. In particular, the pdf changes under 
change of variables, so if we were to pick something boring, in the sense of being uninformative, 
then the very same function in another variable might be very restrictive — recall the discussion 
in Sec. 2 . 4 . 5 . 

With Bayesian statistics, we get exactly what we like — a direct measure of the pdf of a 
model given the data we see. No hypothesis testing and no ambiguous p-values. The price one 
has to pay is the choice of a prior, which in some cases is less trivial than other. In a sense, the 
Bayesian method is trying to answer the unanswerable — doing fundamental physics, there 
is no way we can pick the true prior, since all our knowledge on any subject is derived from 
experience, which again would have to have been interpreted with some prior. 



COSMOLOGY 



Today's cosmological studies are by and large interpreted within the bounds of the so-called 
Concordance or Standard cosmological model. In this section, I will give a summary of the 
theory with some examples of links to observables and experiments constraining it. I cannot 
hope to give a textbook introduction to cosmology, but instead refer to one of the many excellent 
books written on the subject, [ 9 - 12 ]. 


3.1 GENERAL RELATIVITY 


The foundation of modern cosmology is Einstein's general theory of relativity. Here I aim to 
introduce main motivations and concepts necessary for the framework of cosmology 1 . This 
describes not only how matter moves in space and time, but also how matter influences, or 
perhaps more famously bends, spacetime. 

The geometry of spacetime is described by the metric, which tells the distance between 
neighbouring points. We define the proper time interval as 

dr 2 = -g }W dx p dx v , ( 3 . 1 . 1 ) 


which defines for us the metric. The equations of motion for a test-particle in spacetime is, in 
a freely falling, locally inertial coordinate system, a straight line, or more specifically a curve 
of extremal proper time. In this coordinate system, call it £, this means we differentiate the 
coordinates of the particle two times with respect to the proper time and require it be zero. 


dz 2 


= 0 . 


( 3 . 1 . 2 ) 


By reparametrisation invariance — loosely the statement that Nature doesn't care what co¬ 
ordinates we use — we can translate the coordinates £ to any coordinate system x we find 
convenient, leaving all physics invariant. In particular, the line-element Eq. ( 3 . 1 . 1 ) doesn't 
change. 


-Vfivd^d^ = -g llv dx p dx v 


(3-1-3) 


In the £ coordinates, the metric takes the very special form rj — diag( — 1,1,1,1 ). 2 In the x 
coordinates Eq. ( 3 . 1 . 2 ) takes the form 


d_ / dl?dx v \ 
dz \3x v dr J 


d?‘ d 2 x v d 2 ? 1 dx v dxP 
dx v dz 2 dx v dxP dz dz 


d 2 xP p dx p dx a 
dz 2 pa dz dz 


— 0 


(3-i-4) 


where the second line follows from multiplying with and renaming indices. I also introduce 
the affine connection 


tl _ a 2 r d^ 

pa - dxPdx* d£ v 

(3-1-5) 

d 2 ^ _d^ F 
dxPdx a dxP pcr 

( 3 . 1 . 6 ) 


1 The following derivation follows [ 12 ], including his conventions. 

2 Note that I omit any factors of the speed of light c. This factor can be restored by dimensional analysis. 
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Eq. (3.1.4) is known as the geodesic equation. 

There is a subtlety here, which I brushed over. For massless particles — radiation — we cannot 
use the proper time as independent variable to label the path, since this vanishes identically. 
Instead use the zero-component of the coordinate vector, £°. The following derivation is like 
before and we end up with 

_ d 2 X>‘ }i dxP dx a 

° “ 3(f°) 2 +T P (r d^ d£° (3 ' 1,7) 


We will need these equations to describe the propagation and properties of particles in the 
universe. Before doing that, we must know how spacetime reacts to matter. First, let's rewrite 
the connection. Rewrite Eq. (3.1.3) 


_ ag/ 3 

S}lV ~ dxPdx"^ 


(3.I.8) 


and differentiate with respect to the x coordinates 


d S}iv 

3x A 


r a 2 ^ d£p d£ a a 2 <^ 

\ dx> l dx A dx v dxP 3x v 3x A 

r d? a^ 

\ P A dx a dx v dxP vX dx°- 

^A&cv + r L v \gafi, 


Vap 


dap 


( 3 - 1 - 9 ) 


where line 2 and 3 follow from Eq. (3.1.6) and Eq. (3.1.8) respectively. Next, add three of these 
with mixed indices. 


d Sfia , dgva _ dg}iv _ wr , r a „ 
dx v + dxP dx“ l Vrgcra + i-avg^ 

+ ^v}igcra + r ajigtrv 
^ fwgw ^vagfiP 

= 2 T^ v gcr a , ( 3 - 1 - 10 ) 

where I use that the connection is symmetric in the two lower indices, as is clear from the 
definition Eq. (3.1.5). Defining the inverse of the metric, g )lv , 


g FV gvA=% 


(3.1.11) 


we multiply Eq. (3.1.10) by g Aa and get 

pA ^ Aa f d gh* . dg VOL _ d g}iv \ 

> lv 2 * \ dx v dxP dx a J 


(3.1.12) 


This expression is entirely free from the coordinates £, and can be readily calculated given the 
metric g ,lv in any coordinate system. 

Now we want to write tensors describing the spacetime. Using just the metric and its 
first and second derivatives, one can show that the unique tensor which is linear in second 
derivatives of the metric, is the Riemann(-Christoffel curvature-)tensor, 


R a = 


ar 


A gpA 

HP , } lv , tA r </ 

ax* + dxP +1 pApv 


yA yV 

1 Vt] L pp 


(3-1-13) 


Of course we can also take contractions of this tensor, of which the two we will need are the 
Ricci tensor, 


R t n> 


= R 


A 

pAv 


(3.1.14) 
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and the curvature scalar 


r = r ;; (3.1.15) 

In general, a non-vanishing Riemann tensor signifies the presence of a gravitational field. If the 
Riemann tensor is strictly zero, then some transformation takes one back to Minkowski space, 
which has the metric rj^ v . Any non-zero component of the Riemann tensor prohibits such a 
transformation. With these tensors, Einstein's field equations (EFE) take the form 3 

1 

Rfiv ~ ~ A g FV = -SnGT }lv , (3.1.16) 

= -8ttG (j FV - - Ag llv ( 3 -i-i 7 ) 

where T f{v is the energy stress tensor, G = 6.67 • 10 11 Nm 2 /kg 2 is Newton's constant and A 
is the infamous Cosmological Constant. I return to this in Sec. 3.4. The second equation above 
follows from tracing the first. 

Newtonian mechanics As everyone learned in school, Newton predicted the trajectories 
of planets, combining his F c< r~ 2 law of gravity with F — ma. Let's see how this is the 
limiting case of the geodesic equation and a specific geometry — as of course it should be. 

The limit we will take is a stationary weak field, and a slozvly moving test particle. This 
translates to the following expressions 


g}iv — >//(v + hj lv 

IVI < 1 

g v _ 0 


dt 


dx l 

— 

> 


dr 

3r 


Using Eq. (3.1.21), we write the geodesic equation (3.1.4) as 

d 2 x> 1 _ pi (df\ 2 

dr 2 00 \3t/ 


( 3 - 1 -i 8 ) 

( 3 - 1 - 19 ) 

(3.1.20) 

(3.1.21) 


(3.1.22) 


Calculating the connection, we use that all time derivatives of the metric vanish, and 
derivatives only act on the small, h-part. To first order in h we have 


T V _ 1 in/dgo 0 _ 1 „UV dh 00 

00 2 g dx v 2 7 dx v 


( 3 -i- 2 3 ) 


Putting this into Eq. (3.1.22) we get, 


d 2 t 

dr 2 

d 2 x 

dr 2 


= 0 


1 

2 


(I) Vllm ^ 


d 2 X 

dt 2 


1 

2 


V/lQO 


(3.1.24) 

(3-1-25) 


This looks an awful lot like the Newtonian result, 


ma = —mVcp 


(3.1.26) 


where (p is some Newtonian potential. For eg. a spherical mass distribution of mass M, 
this takes the familiar form <p — —GM/r. We see that setting Iiq q = — 2 cp gives us the 


3 Note that sign different conventions for g )lv and may lead to different signs here! 
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Newtonian solution. To check that our approximation holds for typical potentials, put in 
values for the Sun- and Earth-radius and mass, 


\<psun\ = = 2 . 12 - 10 - 


R 


Sun 


lfe rt ; ; | = ^^-6.95-10- 10 


R 


Earth 


(3-1-27) 

(3.1.28) 


Evidently the approximation is very good even at astrophysical scales! 


3.2 THE COSMOLOGICAL PRINCIPLE 


The EFE are in general very hard to solve. Given T^ v , they describe 10 coupled partial differential 
equations for the metric gy V . As such, any exact solution typically has a lot of simplifying 
symmetry. The cosmological principle is one such set of symmetries. In short, it states that our 
or anyone else's place and orientation in the universe shouldn't be special 4 . Any translation 
or rotation must therefore leave the metric invariant. Obviously, the universe isn't exactly 
homogeneous or isotropic. These properties are meant to be approximately true only on 
cosmological scales 5 , meaning when we average matter and geometry over large enough scales, 
this description is suitable. 

This high degree of symmetry forces the line element (3.1.1) to take the form 

dr 2 = dt 2 - a(t) 2 ^ 2 + r 2 dQ 2 ^ (3.2.1) 

where dCl 2 — dQ 2 + d(p 2 cos 2 9 and k € { — 1 , 0 , + 1 } 6 . The different signs of k correspond to an 
open, flat and closed universe, respectively. The metric is known as the Friedmann-Lemaitre- 
Robertson-Walker (FLRW) metric. The function aft) is some so far unspecified function of 
cosmic time t, called the scale factor. To find this function, we must solve the EFE. The source 
must also be maximally symmetric in space, and so takes the form of a perfect fluid 7 , 

T t , v = Pgfiv + (p + p)U fl U v , (3.2.2) 

where p and p are the pressure and energy density of the fluid, and U is the fluid velocity, 
which in the cosmic rest-frame is given by 

U° = 1 
IT = 0 , 


that is to say, the contents of the universe are, on cosmological scales, relatively quiet. Because 
of the high degree of symmetry in the problem, only two independent equations remain of the 
EFE. The first is the Friedman equation, 


a 2 + k 



and the second I take as conservation of energy, and write as 

f(p„ 3 )+ 3 p« 2 = 0 


( 3 - 2 - 3 ) 


( 3 - 2 - 4 ) 


4 Or stated otherwise, the universe is homogeneous and isotropic. 

5 The canonical length scale is 100 Mpc ~ 3 ■ 10 24 m. 

6 Another convention takes a(tg) — 1 and lets k describe the curvature. One can go back and forth by rescaling k, r and a, 
leaving invariant the combination ka~ 2 , the curvature of the space, which is a physical quantity — conventions don't 
affect observables. I find it instructive to keep both explicit. 

7 Fluid in the sense of fluid dynamics. 
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To close the set of equations, we need an equation of state, describing the pressure as a function 
of the energy density 

V = pip) ( 3 - 2 - 5 ) 

Two equations of state are of particular importance. These are of non-relativistic matter, or dust, 
and ultra-relativistic matter, or equivalently, radiation. The two are 

Pmatter P (3-2-6) 

Pradiation p/3 (3.2.7) 


For the two we find, according to Eq. (3.2.4) the dilution of the energy density is 

Pmatter °*- 8 (3-2-8) 

Pradiation 04 ® (3-2-9) 

These factors should not come as a surprise. Thinking in terms of an expanding universe, 
matter is simply spread over greater volumes and dilutes as 1 / V, whereas radiation is not only 
diluted, but also stretched by the expansion. One can in general think of some perfect fluid 
with equation of state 

p = zvp. (3.2.10) 


I will in the following keep the radiation and matter factors explicit, but all calculations can be 
made with arbitrary zv. 8 

With these expression for the energy density, we can in principle solve the Friedmann 
equation. It is customary to rewrite the equation a bit. First introduce the Hubble parameter 
and critical density. 


H = d/a, 
3 H 2 

Pc = 


H 0 = H (today) = 100 /; 


km 

s • Mpc 


8 nG 

Dividing Eq. (3.2.3) through by a 2 we get 

H 2 = H 2 (P- 


A 

m 2 


a 2 H(j 


(3.2.11) 

(3.2.12) 


(3.2.13) 


The density p can now be matter, radiation or both. Taking into account how the two densities 
scale, write 


a 3 a 4 

P — Pm + PR — -f PmO + -jPro 
a 3 a 4 


(3.2.14) 


where «o — a (t = to) is the scale factor today. Now define the density parameters O, as 

Pm 
Pc 


n - P M0 n - P R0 

il m — , Hr — 


n A = 


Pc 

A 

3 W 


f\ = - 


a o H o 


(3-2-15) 


and finally, write the Friedmann equation as 


= H; 


|n m («/«o) 3 + n R (fl/fl 0 ) 4 + n A + n^(fl/fl 0 ) 2 | (3- 2 -i6) 


There are some subtleties in what values of w are physical. I will not address these issues here. 
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Inserting f = to we easily see that the density parameters obey the sum rule 


O m + Oft + + Qfc — 1 (3.2.17) 

Widely accepted, concordance, values for the values of these parameters in the present universe 
are ([13]) 


Cl m ~ 0.3 Qr ~ 0 

n A « 0.7 n t «o 

H 0 » 70 Mpc _ 1 km/s ms 1.3 ■ l(T 41 GeV (3.2.18) 

which is why the current setting is called ACDM. A for a Cosmological Constant, CDM for 
cold dark matter. The actual baryonic matter we are all made of is in this picture a mere 5 %, 
which is included in the fi m here. 

Single component universes For the sake of intuition, let's work through some examples 
of single component universes. In particular, consider the four immediate possibilities — 
matter, radiation, curvature, and Cosmological Constant-dominated universes, with each 
of the four density parameters fi, = 1 and all others 0 . This corresponds to solving the 
equation 


a 

do 

( a \ 

-n/2 . 

f a\ 


~ 

1 =k — = 

~ 

a 

a o 

\«o) 

£Zq 

w 


( 3 - 2 -i 9 ) 


for n G ( 3 , 4 , 2 , 0 }, respectively. Assume now a power-law form, a cx t m . Putting this in 
our equation, we get the condition 


m — —, n 0 (3.2.20) 

n 

For the Cosmological Constant, this solution fails, but we see immediately for n — 0 the 
answer must be an exponential function. For the four different single component universes 
we have the following solutions 

matter dominated (Einstein-de Sitter) 

radiation dominated (3.2.21) 

curvature dominated (Milne) 

Cosmological Constant dominated (de Sitter) 

Finally, extrapolating a —» 0 , we get the following expressions for the age of the universe in 
terms of the present Hubble constant, 


a(t) — ao x < 


a 

a 


\ 2/3 

foj 

tV /2 


to) 

t/t 0 
exp (H 0 t) 


— - - 



matter dominated (Einstein-de Sitter) 
radiation dominated 
curvature dominated (Milne) 
Cosmological Constant dominated (de Sitter) 


(3.2.22) 


Since we observe neither cosmic time, nor the absolute scale factor, it would be nice to have 
a proxy for the two. To this end, we introduce the cosmological redshift, 9 denoted z. This is the 
fractional amount the wavelength of radiation has been stretched by the universe expanding. 
To see how this comes about, place an observer at r — 0 and let a wave crest be emitted at t\ 


9 Not to be confused with the Doppler redshift. 
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propagating radially inwards from some radius r\. For a lightlike test particle, the proper time 
is zero, and we have 


0 = dt 2 



dr 2 
— kr 2 


( 3 - 2 - 2 3 ) 


Call the time it is observed to, we then have the equation 


rt0 dt 

h «(0 


dr 


1 0 \/l — kr 2 Vk 


1 _ 

= —= sin -1 (Vkr-i) 


{3.2.24) 


Notice the sign in taking the square root is fixed by the direction of propagation. The next 
wave crest is emitted shortly after, follows the same path and obeys the same equation but with 
slightly shifted time coordinates, 

CO+INO dt 1 . _j nr 

/ ,, T77T — ~7f sm (vkn), (3.2.25) 

Jt\+\/v\ a \t) yk 

where v, is the frequency at 13. For frequencies much larger than H % 3.2 • 10 18 /;s " 1 we get 

r f 0 dt 


0 = 


rto+t/vo dt ^ 1 1 

/q a(t) Jq+i/vj a(t) ~ a(ti)vi a(to)vo 
N _ fl(fp) 
l/ 0 fl(fl) 

We now define the redshift as the fractional increase in wavelength, 

= A 0 - A1 _YX_\ = _ 1 

Ai vo «(fi) 


(3.2.26) 


{3.2.27) 


This is a nice quantity to work with because it is readily observable through analyses of spectra. 
We can rewrite Eq. (3.2.16) trading t and a for z, giving 


H(z ) 2 - {n m (l + z ) 3 + n R (l + z ) 4 + o A + n k (l + z) 2 } 


( 3 - 2 - 28 ) 


3.3 COSMOGRAPHY 


On cosmological scales, the intuitive notion of distances fails. Depending on the question you 
ask, distances to the same object may differ — by a lot. In this section, I explore the different 
measures of distance and try to clarify their meaning. 

First, let us connect the r coordinate to the physical redshift. Take an observer and an 
emitter, say a galaxy or a supernova, at relative proper distance r\. Emitting a single photon at 
t — t\, we observe it at t = tQ. The photon follows the path described in Eq. (3.2.23), and upon 
inverting Eq. (3.2.24) we get, with a change of variables, 10 


r\ 



(Vk f z dz' \ 

y flo Jo H(z') J ' 


( 3 - 3 -i) 


Usually though, a single photon is not enough. What we might hope to measure is a stream 
of light from a source of known luminosity. Considerations from Euclidian space lead us to 
define the luminosity distance, di as 


F = 


L 

47 id.y 


( 3 - 3 - 2 ) 


10 The following expression holds, by analytic continuation of sin, for all k. 
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where F is the measured flux from an object of luminosity L. Now we seek the relation between 
this definition and the proper distance — and hence the redshift. Note that F and L are 
bolometric quantities, ie. integrated over all frequencies. First, consider the area over which the 
emitted light is spread. Integrating the angular part of the metric, we get a total area, at time to 
— when the light is observed 


A = 4 na(t 0 )r 1 . 


( 3 - 3 - 3 ) 


Travelling across the universe has its price, though. First, the emitted light is redshifted, which 
reduces the energy per observed photon by one factor (1 + z), and second, the distance between 
individual photons is increased, also by a factor (1 +z). This means the observed flux is 
reduced by a total factor (1 + z) 2 , giving 


F = 


47ra(fo) 2 r 2 (l +z) 


d L = (l + z)«(to)ri. 


( 3 - 3 - 4 ) 


Next we look at an object or a feature, which is extended across the sky in some angle 
89 <C 1 at proper distance jq. Looking again to Euclidean geometry, we expect the measured 
angle to be the length of the object, D, divided by the distance d A , 

se = ( 3 - 3 - 5 ) 

d A 

To find the relation between the angular diameter distance and the proper distance, we arrange 
our coordinate system appropriately and integrate only 9 in the metric. Doing this we get that 
the proper distance between the two ends of the object at t\ is 

D = a(ti)riS 9 =>• d A — a(ti)r 1 = (1 + z) _1 a(f 0 )fq- (3-3-6) 


An equivalent definition in terms the solid angle SCI, filled by an object of proper area 5 A is 


d A — 



( 3 - 3 - 7 ) 


The transverse comoving distance is defined as the ratio of the proper transverse motion of 
a particle to the angular motion we see 


dM 


AD / 8t\ 
89/8 to 


d A ( 1 +z) = fl 0 D- 


( 3 - 3 - 8 ) 


Note that it is not, as the angular diameter distance, the physical length of an object. 


Curved space To gain a bit of intuition for curved space, consider measuring dM- 
Without looking to the equations, we ask ourselves "are we going to measure more or less 
than we think?". Recall that in positively curved space, parallel lines get closer and closer, 
while in negatively curved space, they grow further apart — the first point is nwst easily seen 
by imagining a 2-sphere, where lines that are parallel at and orthogonal to the equator will 
intersect at the poles. Nozv, we observe some angle, which is to say at our position, the two 
lines going to each of the tzvo sources we observe have some incident angle at our position. 
As zvejust argued, the separation between tzvo lines changes in curved space compared to flat 
space. This means that in positively curved space, the tzvo lines going to the tzvo sources will 
get closer as they go along, and the distance dM is smaller than in flat space. Conversely, in 
negatively curved space, the lines get further apart and dM is larger, see Fig. 10. This effect 
is exactly the effect of the sin function in the expression Eq. (3.3.1). 
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Figure 10: Sketch of lines with equal incident angle at the observer point, propagating in 
differently curved spaces. From outside in, the universes have k = — 1 , 0 , + 1 . This shows how 
the distance measured is affected by the curvature of space, as the length between the lines at 
the top — at the source position — is changed by the warping of the geodesics. 


We see that the angular diameter distance and luminosity distance are not independent 
from the proper distance — they satisfy the Etherington reciprocity relation, 


d A (l + z) = d M = d L (l + z) 1 


( 3 - 3 - 9 ) 


We finally want to know what part of the universe can ever have had an effect at our position, 
given that the current dynamics are what have always been at play. That is, at a given time in 
the history of the universe, how big was the causally connected part. The proper distance to 
this horizon is just the integral of the square root of the radial part of the metric 


d H — 



dr 

\J\ — kr 2 



(3.3.10) 


The horizon problem Consider a matter-dominated universe, which for the following 
calculation will simulate the universe we live in. Calcidating the distance to the horizon is 
straight-forward, and we get 


d 


H,matter — 


2 

Ho 


(l+z)- 3/2 . 


(3.3.11) 


Watching this horizon on the sky from far away, we expect that any two points further apart 
than dn wzZZ not he in causal contact — and will not a priori know anything about one 
another. Let's calculate the size on the sky of such a horizon patch. The angular diameter 
distance is in the matter dominated universe given by 


d 


A,matter 


2 l-(l+z)— 1/2 
Ho 1 +z 


(3.3.12) 


The angular size of the patch for a given redshift is then 


56 — dn/d A 


( 3 - 3 - 13 ) 


The cosmic microwave background (CMB) radiation is the leftover thermal bath of photons 
from the early universe. The photons decoupled at redshift 1 +zfs 1100 and have been free 
streaming since then. The size of a horizon patch at this decoupling redshift is 

1100 _1/2 

56 «-, ,, = 0 . 031 rad = 1 . 8 ° (3.3.14) 

1 - 1100-H2 u J ^ 

Note that this is significantly less than 180 °. This means that patches on the sky separated 
by more than 1.8° should be completely independent — the exact number changes slightly for 
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different universes, but the point remains. The great surprise is the fact that the temperature 
of these photons is to very high precision constant over the whole sky. This means that 
apparently, the entire observed universe has been in causal constant at some point, yet 
our calculations show, that given the present expansion of the universe, there is no way it 
could have been. This is known as the horizon problem. One solution to this problem — 
inflation — is to insert a sudden de Sitter period, which blows up the horizon, while keeping 
the Hubble parameter constant. This can leave the observed universe inside the horizon 
distance. The problem is of course formulated assuming the metric description holds to t = 0 , 
in order that the integral (3.3.10) can be calcidated. 


For distances at low redshift, z€l, we can expand the expressions previous and do the 
integrals. I will illustrate this with the luminosity distance, and on the way introduce the 
deceleration parameter. Take the expression Eq. (3.3.4) and taylor expand the integral of in r\. 
Since sin(x) = x + 0 (x 3 ), we can get to second order in z while just considering the expression 


H 0 d L =(l+z) 


= (l+z) 


dz f 


Jo '\/O m (l + + (1 — Q m — Qy\)(l + z)^ 

— y (1 — qo) + £>(z 3 )^ , (}o = n m /2-n A 


(3-3-15) 


where qo is the deceleration parameter, and I have ignored radiation, as is justified in the late 
universe. This measures the degree to which the universe is decelerating — named so since 
historically it was believed the universe was decelerating and positive numbers are pleasing. 
Let's see how the acceleration of the FRW universe is related to this parameter. We turn to the 
expansion of the scale factor around the present time, t — to -C H 0 1 , 

= a o +do(t — to) + -^-(t — to) 2 + 0(t 3 ) 

— a 0 + [1 ~ t 0 ]H 0 - - ~ f o]H 0 )- + 0 (t 3 )^j (3-3 -i6 ) 


The coefficient in front of the second order term is just what we're looking for. To see this, 
reorder and differentiate the Friedmann equation, (3.2.16) with respect to time. 


^d = ^H q a^J Cl m (a/a 0 )~ 3 + O a + Cl k (a/a 0 ) 
=>« = H 0 d ^d m {a/a 0 )- 3 + Ct A + Ct k (a/a 0 y 
— 30 m (fl/flo )~ 3 — 2 Cl k (a/a 0 )~ 2 \ 


+ 


2 sJCl m {a/ao)~ 3 + O a + Cl k (a/a Q )- 2 
—2 — Ft/// + 2 Q a 


«o = Hodo 1 + 


ao a o 
-do - -rr- 


^ — Hodo (^m/2 F1 a) 


( 3 - 3 - 17 ) 


We can see qo as a scale-free measure of the deceleration of the universe — the scale of expansion 
is set by Ho and the scale of the universe by flo- Note that qo only describes the deceleration of 
the universe today. Generally, q changes throughout the course of the universe. Only in very 
special cases the universe is forever non-accelerating. 


3.3.1 Moving emitter and observers 

The Doppler effect, being a well established phenomenon, also has to be taken into account 
when measuring the universe. Typical peculiar velocities of galaxies, which is to say the velocity 
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in excess of the Hubble recession, are expected to be of the order a few hundred kilometers per 
second, ie. v ss 10 ' 3 in units of the speed of light. This includes both us as observers and eg. 
SNe as emitters. The problem I wish to address in the present subsection is what difference 
this makes to the light we receive. Now, since these velocities are only mildly relativistic, we 
shall only look at the first term in an expansion around zero velocity. The following derivation 
follows the work in [14]. 11 

First we realise that the redshift we see is not only redshifted by the expanding universe, but 
also by normal relativistic Doppler shifting. Denoting the expected redshift in a completely still 
universe by z as before, we write z for the corrected redshift in the universe where everyone is 
moving around. By normal Doppler shifting, z is given by 

1 + z — (1 +z)(l +n ■ [v e - Vo\) + 0 (v 2 ) (3.3.18) 

where the 1 are the velocities of the emitter and observer, respectively, and n is a unit 
vector point from the observer to the emitter. From here onwards, anything but the first Vj 
term is neglected. Now, beaming effects also come into play, in particular the solid angle of the 
emitter is changed by relativistic beaming as 

<50 —> ( 50(1 — 2 n ■ Vo) ( 3 - 3 - 1 9 ) 


Note that this only depends on the observer-velocity, not the emitter. This changes the angular 
diameter distance, Eq. (3.3.7), which we can in turn link to the luminosity distance through 
Eq. (3.3.9). We see that the changes are the following 

d A (z) = d A (z)(l + n ■ Vo) (3-3- 2 o) 

_ (1 H - Z 

=^d L (z) — d L (z)(l + n ■ v 0 )j— — y2 « d L (z)(l + n ■ [2v e -v 0 ]) (3-3- 2 i) 

(1 + z) 


We are still not done yet, as this last equation does not relate directly observable quantities. The 
redshift we observe is naturally z, so we will have to also evaluate di at this slightly shifted 
redshift. What I will do is a simple Taylor expansion of the function. This means we take 

d L (z) = d L ( z) + M ^ (z - z), (3-3- 22 ) 

where we can write z — z = — (1 + z)n ■ [v e — v 0 ], and so we just miss the derivative. From 
Eq. (3.3.4) we get 


dd L (z) _ d L (z) 1 +z 


cosh 


dz 1 +z H(z) 

Putting this into the former expression we finally have 


Vci^dc/ du 


(1 “h z)2 

d L (z,n) = d L ( z) [l-n-Ve] -cosh 


Vn k d c (z)/d H n-(v e -v 0 ) 


W=o . ri 1 (l + z ) 2 / \ 

-> d L (z) [1-n- v e \ - n ■ (v e - v 0 ) 


(3.3.23) 


( 3 - 3 - 2 4 ) 
( 3 - 3 - 2 5 ) 


Now, the random movement of emitters will induce an uncertainty of this sort. I return to 
this point later. This also means that since the Earth is not completely still in the universe, we 
will have to correct for this effect. This movement of the Earth can be estimated by assuming 
there is no intrinsic cosmic dipole in the CMB, and then looking at how big the observed dipole 
is. This dipole must then be the result of a doppler shift, from which one deduces the velocity 
^Earth through space ~ 369 km/s, ([15, 16]). Since this is a constant effect, it is usually subtracted 
from data sets before publication. 

Effects of this kind in relation to SNe have been addressed in eg. [14,17-20] regarding both 
uncertainty estimation and direct searches for bulk flows. 


11 


Note that this particular article follows a different notation — the bars are non-bars here and vice versa — and only 
treats flat space. 
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3.4 THE COSMOLOGICAL CONSTANT 

I will now devote a single section to an unfairly brief discussion of a problem, whose formulation 
is maybe more subtle than its answer. The assumed detection of a Cosmological Constant of 
order Hq, ie. Q,\ = 0 (1) is puzzling for many reasons. I will try to sum up the problem, and 
refer to some of many reviews on the subject for a deeper analysis, see eg. [21-23] an d their 
many references. 

The first issue with this particular problem is, that it is not immediate what the problem 
actually is. We have measured some value of a particular constant in our theory, namely A in 
ACDM — and so what? The first question we might ask ourselves is, why 0 ( 1 )? How come 
that the Cosmological Constant A knows about the hubble scale today, and is just about that 
value. Now, this may just be a coincidence. 12 Even if it were, things are not this simple. 

The value of the hubble constant, as we saw in Eq. (3.2.18) is very small compared to energies 
of eg. masses of standard model particles, m e « 10 °’GeV for the electron to heavier particles 
like the Higgs, which is m# ~ lOOGeV. Now, the masses of particles come in since the EFE 
have both a left and right hand side. The bare Cosmological Constant is the term A on the left. 
But the stress energy tensor, when considering a quantum field theory living in your theory 
of gravity, gets vacuum contributions, which we might denote (P' v ). By Lorenz invariance of 
the vacuum, this contribution must be of the form — Pvacmimg 1 "' — it looks exactly like the A 
term. Now the problem is not just that the Cosmological Constant has a peculiar value, but that 
two distinct physical effects cancel such as to make the sum A + 8 nGp ~ Hq. To see why this 
seems unreasonable, we have to look at the natural sizes of the individual terms. 

The classic tale of p is vacuum fluctuations of the Standard Model fields. As free fields in a 
quantum field theory are quantized as an infinite sum of harmonic oscillators, for which the 
zero-point energy is (v/2, the zero-point energy of a single field is in some sense the sum of 
these individual terms. As this sum of course diverges, one may be inclined to put in a cut off, 
with the argument that eg. we don't know what happens above the Planck scale Ep, and so the sum 
only goes to energies of order Ep « 10 18 GeV. This naive argument gives vacuum contributions 
of the order p « E 4 « 10 72 GeV 4 — one power from the energy of the oscillators, and one 
power from each of the spatial dimensions we integrate over. This is to be compared with the 
energy 'density' of the A term, which is about the critical value. Pa ~ Pc ~ 10 ~ 46 GeV 4 . The 
discrepancy between these two numbers is the famous 72 — (— 46 ) « 120 orders of magnitude 
between theory and observation. 

There is however a flaw in our previous derivation. We introduced an energy cutoff, which 
explicitly breaks Lorentz invariance — yet we are trying to calculate a manifestly Lorentz 
invariant quantity. This is not so good. Doing the calculation more carefully also shows that 
what we did before would lead to an equation of state w = +1 / 3 . It looks like radiation! This 
is nothing like what we want. It is immediate that we have to abandon the sharp cutoff. What 
we must do is find a Lorentz invariant way to get rid of the UV — the high energy modes, 
which we do not know exactly how behave. Taking a clue from particle physics, we can do 
dimensional regularization. This is doing the calculation in a general dimension, d. Of course 
the original answer will still diverge, but doing the calculation like this, we can exactly see 
where and how the infinities occur. That means we can meaningfully subtract an infinity from 
our result to get something observable. Doing this calculation, we get that it is not the cutoff to 
the fourth power, but the mass of the individual fields to the fourth power, summed, up to some 
constants. 


12 There is a related problem, called the coincidence problem. This is the observation that living in a universe with 
comparable matter and dark energy densities, O m ss Ha, seems somewhat unlikely. Extrapolating back to, say 
recombination, the matter density has since then been diluted by a factor rs 10 9 , while the dark energy density is forever 
fixed. Yet just now, when we are here, they are almost equal. See [ 24 ] for a more precise definition and discussion of 
this problem. 
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But we're still not done. Another term contributing to the vacuum energy density is the 
zero point of any potential of any particle. There's only one obvious one in the Standard 
Model, which is the Higgs potential — the, now famous, Mexican hat. This has a peculiar effect 
attached to it, since its zero point is different in the past, very hot universe and in the present, 
cold universe. Namely, when the universe is very hot, the potential does not actually look like a 
mexican hat, but like a normal cp 2 potential because of thermal effects. This in turn means that 
the difference between the potential energies of the vacuum before and after the phase transition 
is mfj/( 4 A), where /«// is the Higgs mass and A is the Higgs self coupling. If we interpret the 
potential energy as contributing to the vacuum energy density, this means that either before or 
after, we are going to have a massive contribution from the Higgs potential. A similar thing 
happens when chiral symmetry in QCD 13 is spontaneously broken [25]. Inserting standard 
model values for these quantities, we get 

I PEW phase transition I ~ 10 GeV (3'4'-0 

IPQCD phase transition| ~ 10 GeV (3*4*^) 

Collecting all the terms so far lands us at ([26]), 


Pvacuum ~ dr | Pew phase transition I dr | PQCD phase transition I 

4 

7777 

+ PAd- E ^ 6 * 7 ? (3 ' 4 ' 3) 

SM field degrees of freedom 

where the dr show that there is no a priori preference for what should be the zero point of the 
phase transition energies, and the minus in the sum is only there for fermion fields. Since the 
top is so heavy, this sum evaluates to something negative of the order fields P ~ — 10 8 GeV 4 . 
Thus, the problem has been ameliorated a bit from the initial 120 orders of magnitude fine 
tuning to a mere 8 — (— 46 ) = 54 orders of magnitude. Fine tuning here means that we have at 
least the four terms in Eq. (3.4.3), maybe more, all of which are very big, and cancel, apparently 
not exactly, to 54 decimal places, to give us the value fL\ « 1 today. A very long explanation of 
all this is found in [23]. 

This is the Cosmological Constant problem. The apparent almost-cancellation to an unrea¬ 
sonable number of decimal places of quantities that should know nothing about one-another — 
eg. why would the Higgs potential know what the hubble scale is, and why would an arbitrary 
constant, the Cosmological Constant, know what the top-mass is? 

3.5 ALTERNATIVE VIEWS 

The story of cosmology in text books is fairly straight forward. Here I want to present 
some views opposing the very optimistic approach of the perturbed FLRW metric as a valid 
description for the entire universe. I hope to summarise the idea behind some select points of 
view in recent literature, but this is by no means meant as even a fair introduction to the subjects, 
each of which could have been the subject of an entire thesis. As such, I will be skipping 
technical details, and simply appeal to the idea behind and intuition about the approaches. The 
nature of the different subjects varies a lot, from changing gravity itself to doing more careful 
studies of the existing gravity, and the nearby universe. 

Because of the large and ever increasing number of cosmological datasets, there is a host of 
constraints on any model. I will mostly address issues regarding supernovae, while reminding 
that other non-trivial constraints exists. 


13 Quantum Chromo Dynamics, the theory of quarks, gluons and their color interactions. 




38 


COSMOLOGY 


3.5.1 Changing gravity 

To see the start of this approach, we have to reformulate the derivation of the EFE a bit. As 
it turns out/ 4 the field equations can be found from the principle of least action, given the 
Lagrange density 


L = 


1 


167 iG 


R 


( 3 - 5 -i) 


We then define the action as S — f d 4 x x /—gL, and the sourceless EFE follow from requiring 


SS 

hv-v 


= 0 


( 3 - 5 - 2 ) 


By adding a matter term L m to Eq. (3.5.1) we get the sourced EFE when we identify T- lv — 
— 28 L m /Sgpy + gt n 'L m . We may also add the constant A with proper normalisation, which is 
the Cosmological Constant. This means we get the total Lagrange density 


T _ R — 2 A | r 

L — ^77 — r L m 

16nG 


( 3 - 5 - 3 ) 


Now inspired by the effective field theory approach of particle physics, we simply consider 
adding more R-like terms to the action. Without specifying further, we just have some function 
of R, and we have the lagrange density 


L = 


1 


167 tG 


f(R) + U 


( 3 - 5 - 4 ) 


from which these kinds of theories derive their name f(R)-gravity. This is fundamentally 
changing gravity. Without some great insight, all we are now left with is fitting not just L m 
and A, but also the infinite dimensional function /(R), which may or may not be parametrised 
in some way. In particular the FLRW metric is still viable, and so this really extends the 
Cosmological Constant. Note how in the above Lagrange density, A has been absorbed as the 
constant part of /(R). 

An interesting observation is that Starobinsky inflation ([27]) is an f(R) extension 15 , which 
— although having its own problems — solves problems related to inflation. 

Of course, constraints on deviations from general relativity are tight, see eg. [28], so 
constraints on reasonable functions /(R) are too. For a comprehensive review of these theories 
see eg. [29]. 


3.5.2 Averaging problem 

The following approach questions what it means that the universe is homogeneous and 
isotropic on average [30]. The first problem becomes the actual averaging process. It turns out 
that averaging anything but scalars is a problem, since in general the average of a tensor field 
does not transform as a tensor. What was started in [30] was the study of averaged scalar fields, 
in particular the matter density of the universe. One starts by defining the spatial average of a 
scalar field over a particular region V of the universe as 

(Y (x,t)) = y Vdeth d 3 xY (3-5-5) 

14 I will not do the computation, which is messy and not very enlightening. I instead refer to eg. [ 10 ] for a thorough 
walkthrough of the results. 

15 Although it is hidden away in the original article — the R 2 term is put into the Tl w . 
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where h is the spatial part of the metric and the volume is given by 

V(t) — [ V det h d 3 x (3-5-6) 

Jv 

This allows us to define an effective scale factor for V as 



One can now average eg. the Friedmann equations with no Cosmological Constant, which gives 

On comparison with Eq. (3.2.13), we recognise both the density and 1 Z term, which is disguised 
as k, but also notice the appearance of a new term Qy, which in the simplest case is defined as 16 

Qv — 6((H 2 } — (H) 2 ) (3.5.9) 

Ie. Qy is a measure of the inhomogeneity of the expansion of space, and this term feeds into 
the Friedmann equation. As it turns out, looking at the equation of state of this new term 17 we 
find that it behaves just like a Cosmological Constant, wq — — 1 . Furthermore, it also feeds into 
the sum rule of Eq. (3.2.17). These points mean that neglecting this term naturally leads to a 
biased parameter estimation. 

A nice feature of this approach is that the beginning of cosmological acceleration in the 
FLRW sense seems to coincide with structure formation. This has an immediate interpretation 
in this formalism, since now the inferred acceleration is linked to the inhomogeneous nature of 
the universe, [32]. 

3.5.3 Exact inhomogeneous spacetimes 

The Lemaitre-Tolman-Bondi (LTB) metric is an exact general solution to the same questions 
as the FLRW metric was, except homogeneity, as found very early in [33] and later again in 
[34]. Inspired by the isotropy of the CMB, this was first rediscovered as a physical model of 
cosmology in [35]. Actual fitting of mass profiles to various datasets, including SNe, has been 
carried out in eg. [36], which also introduces the various concepts I use below in a simple way. 
This approach abandons the exact cosmological principle and suggests that our immediate 
neighbourhood does not have the same density as the rest of the universe, eg. we could be 
living in an underdensity. 

The LTB metric is given by, when molded to a suggestive FLRW-like form, 

ds 2 = -dt 2 + dr 2 + A 2 (r, t)dCl 2 , ( 3 - 5 -io) 

where A' = Comparing to Eq. (3.2.1), we notice that putting A — a(t)r and K — —r 2 k, we 
obtain again the FLRW metric, which is of course a special case of the LTB metric. 

We can again derive a Friedmann-like equation for this spacetime, by putting in a suitable 
matter term in the EFE, and we get 

(-J =H 0 {n m (A/A 0 )- 3 + n K (A/A 0 )- 2 } (3.5.11) 

16 In the interest of intuition, I am skipping a lot of definitions, in particular here is a slight abuse of the original notation. 
I use here ©j! = 3 H instead of (©j)} = 3 H. 

17 It was found in [ 31 ] that this can also be interpreted as a scalar field called the morphon. 
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for some reasonable definitions of O, — which of course reduce to the versions we already 
saw in the homogeneous limit. What we are now left to do is determine the properties of the 
various functions involved. In particular determining A, which can be thought of as a spatially 
varying scale factor. 

The intuitive picture of how an inhomogeneous universe might resemble a universe with a 
Cosmological Constant can be thought of as follows. What was initially claimed in the SN data 
was that the far-away SNe were fainter than what was predicted in cosmologies with no A, ie. 
they were further away. This was interpreted as a recent onset of acceleration of the expansion 
rate, which in FLRW can only be explained by a A term. In an inhomogeneous universe, this 
accelerated expansion is instead explained as the far away universe simply not having the same 
matter densities as the nearby one. This makes it possible to have different expansion rates at 
equal times in the universe without invoking a Cosmological Constant. 

There have been arguments over the physical validity of the Earth being the centre of the 
universe, when taking the zero-point of the LTB coordinates to be us, here, see eg. [37, 38]. This 
point though, is taking the LTB too literally, [39, 40]. In its form here, it should still be thought 
of as an approximation to what is really going on. This includes the immediate idea that we 
are, most likely, not the center of the universe. It might be the case in some average sense, that 
an inhomogeneous metric captures the real world better than a perfectly symmetric one, [41]. 

Other exact solutions exist, like the Szekeres model, [42, 43] and more contrived examples 
like patching together FLRW- and LTB-metrics in a kind of Swiss cheese model, [44]. 

3.5.4 Darkfloiv 

Everything we cannot immediately explain the origin of is called dark. Dark matter is also 
dark because we have no evidence that it interacts with light, but dark energy is simply dark 
because we have no idea what it is. In the same way, there have been claims that there exists 
an unexplained large scale bulk flow — a dark flow — of the nearby universe, see eg. [45, 46]. 
The first problem is explaining such a large bulk flow in what is supposed to be a very still — 
maximally symmetric — spacetime. Assuming this is done, the presence of the dark flow may 
mimic cosmic acceleration, [47, 48]. 

The argument is, that the observed acceleration, originally parametrised by qo, is affected by 
a large bulk flow. First of all, one realises that the apparent hubble constant changes according 
to the size and magnitude of the bulk flow. This allows one to write the deceleration parameter 
in the dark flowing frame — in which we are supposed to reside. Supposing the universe is 
only non relativistic matter, the global deceleration parameter is qo — fi,„/2, while the local 
deceleration parameter takes the following form 

1 +?0 = (1 + n m / 2 ) ^1 + —^ ^ + 34P ) ( 3 - 5 - 12 ) 

where 6 is a measure of the bulk flow. The difference between dots and primes is a change 
of frame, but are both time derivatives. The problem now becomes translating from bulk 
velocities to these quantities, especially 6 and 6 . In [48], using the most optimistic values for 
these parameters, one gets a change in the deceleration parameter as one goes from the still 
frame to the bulk flowing one as large as — 0 . 3 . More conservative estimates diminish this by 
about an order of magnitude. Even if one is just interested in vanilla ACDM, this is an effect 
one cannot neglect in a proclaimed era of precision cosmology 


SUPERNOVAE 



Studying supernovae (SNe) dates back hundreds, if not thousands of years. If one is lucky, as 
were the Chinese and Tycho Brahe at different times in history, these exploding stars can be 
seen as clearly as every other star. Indeed the one observed by Tycho was called Stella Nova, 
the new star. The first systematic attempt of scientific research with these stars was done at 
the Palomar observatory, [49]. Already early on, the SNe were split into different types, I and 
II, depending on their primary element abundances. Later, with more data, the types la, lb 
and Ic were distiguished, and today many more subclassifications exist, lax, Iln and IIP, IIL. 
See eg. [50] for more on the classification and [51] for the new lax class. For a history of SNIa 
observations, see eg. [52]. 

What will be the subject of the present section is strictly Type la SNe for cosmological 
purposes. This class was early on seen to be relatively homogeneous, ie. their absolute 
luminosities are very similar. Having a standard luminosity would mean one could map out the 
distances in the universe using the relations derived in Sec. 3.3. This can easily be understood. 
Take a Euclidean, flat, space and scatter, say 60 W lightbulbs in it. Measuring the flux, F, from a 
bulb, we easily find the distance to it using F — L/(And 2 ). This luminosity distance is exactly 
what we found an expression for in Eq. (3.3.2) for a general spacetime. Obviously, there are a 
host of complications in this procedure, and here I want to illuminate some problems and their 
proposed solutions. 

4.1 SUPERNOVA PROGENITORS 

I will not try to review the history of stars here, but simply state that when stars such as our 
Sun, a so-called main sequence hydrogen burning star, ends its life, it becomes a white dwarf [53]. 
The main point we shall consider about white dwarfs is that they are supported against gravity 
mainly by electron degeneracy pressure. This is a pressure coming from Pauli's exclusion 
principle — two or more electrons cannot be in the same state, and so if we squeeze electrons 
enough, they will fight back. Detailed calculation of a degenerate electron gas shows that the 
radius of a white dwarf shrinks as we put more mass in it. This leads us to the limit beyond 
which the degeneracy pressure cannot support the star — this is called the Chandrasekar limit 
[54]. The numerical value depends on the distribution of mass in the star and the ionisation 
degree in the gas, and is around 

Me ~ 2.86 ■ 10 30 kg « 1.44 sun masses (4.1.1) 

These white dwarfs are what we think is going to be type la SNe. The leadup to the SN 
explosion is still uncertain, but one story, for which [55] recently found concrete evidence, goes 
as follows. Take a system with one white dwarf and another star. The white dwarf may now, 
over time, suck in matter from the companion star. This only continues as long as the white 
dwarf is stable — at most until the Chandrasekar limit, at which point the white dwarf heats 
up, collapses and initiates a thermonuclear reaction, releasing more than enough energy to 
blow the white dwarf apart. 

The main point of the story is that we have reason to believe that all SNe of this type came 
from stars of about equal masses. Now if all the SNe have similar boundary conditions and 
similar evolutions, then we might expect that these can be used as our standard 60 W bulbs in 
the universe, [56]. 1 

1 Of course, this is not the only possible scenario for the progenitor of SNe, and different scenarios might lead to different 
energy outputs and evolutions. This is an immensely important point, which I will neglect for the main part of the 
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The result of this violent explosion is a lot of highly radioactive material flung in every 
direction. 2 The light from the radioactive decay of this debris is what we observe, and is what 
contests entire galaxies in luminosity 

4.2 OBSERVING SUPERNOVAE 

Once the hurdle of actually finding SNe is overcome, 3 the observation is a timeseries of 
photometric measurements, ie. fluxes through various colour filters. These timeseries are called 
lightcurves. The classic photometric system is the Johnson-Cousins or UBV (ultraviolet, blue, 
visual) system [58]. Many more systems exist today, see eg. [59]. Such a photometric system is a 
series of window functions on the allowed frequencies/wavelengths of the observed light. An 
example of such an observation in an extended system is shown in Fig. 11. 



Figure n: Optical and near-infrared lightcurves of SN 2007af from the Carnegie Supernova 
Project. The mean wavelength of the bandpasses ranges from 350 nm (u band) to 1600 nm (H 
band). The y-axis is in apparent magnitudes, and the time axis is shifted so day 0 is at maximum 
B band brightness. Figure is taken from [60]. 


Now to do cosmological studies, we need a way to convert from the observed fluxes to 
distance measurements. To do this, we need three conventional things: a flux, a luminosity 
and a distance. These three quantities naturally must obey Eq. (3.3.2). Measuring another flux, 
as we do, and assuming this comes from a standard candle, ie. a class of sources of equal 
luminosity, as we hope SNe are, we have the following relationship. 


F/Fref = 1 

F/Fref (dl/^L, ref)“ 


(4.2.1) 


analysis. Different unidentified classes of SNe la could indeed bias the results of an analysis, which does not identify 
them as such [ 57 ]. 

Although not necessarily isotropically! 

This point is naturally very non-trivial, but will not concern us too much. 
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We now define the apparent magnitude m — — 2 . 51 og 10 F/F re f and absolute magnitude M — 
— 2.5 log 10 L/L re f, which gives us the expression 


fi — m — M — 5 log 10 


dp 

dp,ret 


( 4 - 2 - 2 ) 


For historical reasons, di re f = lOpc. p is called the distance modulus. It is now apparent that if 
M is the same for all SNe, then measurements of the flux are directly linked to the luminosity 
distance, which reveals to us the expansion history of the universe. Of course we don't expect 
the intrinsic scatter in the luminosity to be exactly zero, even if the progenitor scenarios are 
similar. The earliest observations of SNe were too scattered for a good cosmological study. But 
they did show a remarkable feature — the width of the lightcurve was tightly linked to the 
absolute magnitude. This effect is known as the Phillips relation [6i], and the very first plot used 
just 9 observations, see Fig. 12. Later, Tripp [62] found another correlation with the colours 
of the SNe. These two quantities are marked in Fig. 11. The hope is still today that one will 
be able to reduce the scatter in the Hubble diagram even further by finding more observables 
with significant correlation to the Hubble residuals, see eg. [63, 64] for some examples of this. 
Taking the corrections of the absolute magnitude to be linear in the new observables — as is 
approximately observed in Fig. 12 and corroborated in later studies, [65] — we have for the 
two-parameter model, writing in modern notation X\ for the shape of the lightcurve and c for 
the colour correction. 


p = m — M —s- m — (M — otx\ + / 5 c), (4.2.3) 

where a, /3 are unknown coefficients — we still have no good theoretical model of what these 
should be. This parametrisation of the corrections is the one used by the SALT (Spectral Adaptive 
Lightcurve Template) analysis, see [65, 66] for detail about the fitting and the exact meaning of the 
parameters X\, c. In short, higher X\ are broader lightcurves, and higher c are redder colours. 

This means that we now see the SNe as standardisable candles, their luminosity can be 
corrected to be more or less standard (we will see later to what degree they actually are). 4 
Naturally, these measurements come with some uncertainty due to experimental noise. This 
means that our dataset of the maximum B band apparent magnitude, shape and colour 
correction, (m|,Xi,c) comes with some covariance matrix. One part of this is the statistical 
uncertainty — the noise — and the other is various systematic uncertainties. Determining these 
uncertainties is a big part of any analysis. This also shows up in how surveys are done, since 
early time coverage of the SNe is important to precisely determine the parameters, especially 
the width of the lightcurve. This means that ideally one wants to take pictures of the sky where 
there is no SN, since if there is going to be one in ten days, you want to see the very early time 
light — notice how in Fig. 11 the observations start about ten days before maximum brightness. 

4.2.1 K-correction 

When we derived the luminosity distance, we considered the bolometric luminosity, ie. the 
luminosity integrated over all colours of the light. Yet, when calculating the distance estimate 
we only have light in certain bands. Now, redshifting the spectrum, but keeping the filters fixed 
— for obvious reasons — we introduce a redshift dependent bias in the distance estimate, given 

by ([70, 71]) 


K = 2 . 51 og 10 (l +z) +log 10 


( I F(A)S(A) dA \ 
(f F(A/[l+z])S(A) dA/ ' 


(4.2.4) 


4 Other parametrisations exist, most prominently the MLCS(2k2) (Multicolor Light Curve Shapes) [67, 68] and the newer 
SiFTO [69]. A more complete list of older methods is found in the introduction of [69]. 
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Figure 12: The plot from [61], where Phillips noticed the trend that as the lightcurves get 
wider, the SNe are brighter. The trend is present in all three bands considered here, but most 
prominent in the B band. The x axis, is the decline in B band apparent magnitude 

after 15 days, ie. small numbers correspond to wide lightcurves. The y axis shows the derived 
maximum absolute magnitudes in different bands. 
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where F(A) is the differential flux and S(A) is the filter response. Now, given F, we can correct 
for this by taking instead of the bolometric absolute magnitude, 

M^M + K (4.2.5) 

This is usually done beforehand in data releases, and so we do not need to worry about it. 


4.3 COMPARISON WITH THE COSMOS 


As calculated in the previous section, we find for every SN a distance modulus, which we want 
to compare to our cosmological model. From Eq. (4.2.2) and (3.3.4) we see that the expected 
distance modulus is 


ji c = 51og 10 (d L /10pc) = 51og 10 


(1+Z) sinh 


+25 — 5 log 10 h + 5 log 10 


H 0 dz' 

0 H(z) _ 

c 

lOOkm/s 


(4-3-i) 


where the last log evaluates to ~ 3.477. The take-away here is that when comparing this with 
Eq. (4.2.3), both contain unknown constants. The measurement has the absolute magnitude, and 
the theoretical expression contains the Hubble constant. Importantly, these combine to a single 
constant which factorises completely from the rest of the expression when taking the difference 
— as we shall do later. That means we cannot with SNe alone constrain either of these! What 
one does is to set the Hubble constant to something reasonable, eg. h — 0.7, 5 and then fit the 
absolute magnitude. It is important to remember though, that without a direct measurement of 
the absolute magnitude or the Hubble constant, we can't break this degeneracy. 6 

Let's now go through some of the cosmological effects adding uncertainty to the measure¬ 
ments. 


4.3.1 Peculiar velocities 

Deriving the luminosity distance, we assumed that the source was stationary. Using the results 
from Sec. 3.3.1, we may estimate the error we commit when doing this. From independent 
measurements, we estimate the variance of the isotropic velocity field to be about cr v « 
150km/s. 7 By Eq. (3.3.18), this leads to a redshift uncertainty, given in terms of the variance of 
the peculiar velocity along the line of sight. This is of course just <J V , and so 

^z.pecvel — (1 4 ” Z)a v (4.3.2) 

The (1 + z) term here is usually neglected with the reason that these errors are important only 
at low z, when the term is small anyway. Now we just need to convert this redshift uncertainty 
to an uncertainty in the distance modulus. This is computed approximately as 

_ d}i _ 5 dd L 1 

Vpecvel “ Wpecvel ^ - W,pecvel log ( 10 ) dz H ' 3 ' 3j 

The usual procedure now is to take some cosmology and calculate -ip- explicitly. Which 
cosmology one choses doesn't matter much, since they are all similar at low z where the error 

This choice is manifestly arbitrary, and can always be changed after the analysis. 

What has been refined for decades is the so-called distance ladder. This is a series of classes of objects, each with its 
own defining feature which allows determination of the distance to it. Every class is then a rung on the distance ladder, 
see eg. [12]. Putting the SN observations on this ladder allows one to determine the Hubble constant and in turn the 
absolute magnitude. 

Isotropy here is a rather rough approximation, since this is extrapolated down to small scales — another method takes 
into account the correlations in the velocity field, see eg. [20]. Here I just aim to illustrate the physics behind the effect. 
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is important, [20]. So let us choose the empty universe, fi m = (t,\ = 0 , which has luminosity 
distance 


1 z 

di, empty = sinh (log(l + z)) 

dd L 1/1 1 \ 

^ dz d L \1 +z (1 +z)tanhlog(l + z)) 

Evaluating tanhlog(l +z) — z(z + 2)/[2 + z(z + 2 )] reduces Eq. (4.3.3) to 

5 ( 1 +z \ 

Vpecvel » ^pecvel log(10) ^(4 +z/2 ) J 


( 4 - 34 ) 

( 4 - 3 - 5 ) 


(4.3.6) 


This is then usually added in quadrature to other errors, since we assume this effect is entirely 
uncorrelated to all other errors. This effect is why cosmological datasets usually have a lower 
redshift limit. We only want to look at SNe, which are safely in the Hubble flow, ie. where 
peculiar velocity effects are not dominant. The exact lower limit varies from analysis to analysis, 
but is usually of order 10~ 2 . 


4.3.2 Weak gravitational lensing 

Gravitational lensing, a subject in its own right [72], also affects SN measurements [73-76]. 
Light running through the universe is bent by the inhomogeneous large scale structure. This 
means that the flux we infer is also contaminated by the distorted image — eg. a demagnified 
SN will appear fainter, ie. further away. This effect is greater for far-away SNe, as is intuitively 
clear — the further away, the bigger the optical depth, ie. the more lenses to distort the image. 

For a precise determination of the lensing effect, one needs not only properties of the 
universe, but also of dark matter haloes — the profile of dark matter surrounding galaxy 
clusters. These are hard to determine, and so the exact numbers for the lensing uncertainties 
vary from work to work. An early study [77] found a linear relation between the noise and 
redshift, quoting an error of cq ens ~ 0 . 088 z, while a newer study [76], dedicated to the data of 
[78], finds cq ens rs 0 . 055 z. This is indeed the value used in [78], which we also take. This error is 
also added in quadrature, even though the actual lensing bias is not expected to be gaussian. 

Future surveys hope to be able to correlate large scale structure with the lensing bias — ie. 
to look at the line of sight through which the SN was found and try to determine separately the 
expected lensing, and as such make the once uncertainty a new signal. These possibilities will 
be explored by eg. the Dark Energy Survey . 8 


Thanks to Tamara Davis for insight on this point. 








PUTTING THE PIECES TOGETHER 



Now we combine the last three sections into an analysis of SN data. The ultimate goal of this 
analysis is of course to lay out the expansion history of the universe, potentially unravelling the 
mysteries of the cosmos. In less grandiose terms, we wish to constrain the parameters O m , (l .\ 
of our favourite, maximally spatially symmetric space-time, the FLRW model. 

The data I use is the Joint Lightcurve Analysis (JLA) catalogue [78] — a combination of data 
from Sloan Digital Sky Survey (SDSS-II), the SuperNova Legacy Survey (SNLS), SNe from the 
Hubble Space Telescope (HST) and some low redshift SNe from a selection of other surveys. 
See also [79] for description of selection criteria and outlier rejections. In this work, a lot of 
issues of combining SN surveys have been adressed, in particular calibration issues between 
telescopes and the empirical training of the SALT procedure. 

This section follows closely our recent work [1], only in more detail. 


5.1 THE DATASET 

All data I use, including covariances, is available through the website of the JLA collaboration. 1 
Here I wish to give a brief overview of how it looks and feels. Fig. 13 shows the distribution 
of the SNe on the sky in equatorial coordinates. The SDSS stripe is about 2 . 5 ° wide and 120 ° 
long, while the SNLS samples 4 regions of low galactic extinction with area 1 square degree 
each. This distribution on the sky makes the high redshift surveys particularly bad for dipole 
searches a la Sec. 3.5.4, since we have information only in very limited sections of the sky — 
any multipole expansion of the velocity field of the far away universe will be wildly unstable. 
The redshift coverage of the different surveys is shown in Fig. 14. Notice in particular how the 
SDSS has filled in a gap around redshift 0 . 2 . This gives some constraining power over the most 
naive implementation of void models, where the Hubble parameter would 'jump' between the 
datasets. 

Next, let's have a look at the correction parameters x\,c of the SALT fits. These are presented 
in Fig. 15. The superimposed gaussians are simply put there to guide the eye. As of writing 
this, I know of no theory of the distribution of these parameters. That is why we assume the 
theoreticians dream, that they come from gaussians. This is almost certainly not true, but 
will be our first step towards finding out more about these distributions. As we will see, we 
absolutely need to deal with these distributions. To see why this is the case, let's look at the 
distribution of the uncertainties of the measurements. The diagonal elements of the covariance 
matrix are presented in Fig. 16. For some of these measurements, not only does the intrinsic 
error and the experimental one have similar size — the experimental noise may completely 
dominate for the most poorly sampled lightcurves. For most of them though, this is not the 
case, and the measurement is quite good. The point still remains in principle though, and we 
lose nothing — except computing time — by doing a more thorough analysis. 

Naturally, one could consider more intricate distributions for these two parameters. This is 
simply a matter of introducing more and more parameters to describe them. However, without 
any physical motivation to do so, we simply put in gaussians. Even if they are not optimal, they 
capture the most prominent feature — they have some spread, which we wish to quantify. 


1 http://supernovae.in2p3.fr/sdss_snls_jla/ReadMe.html 
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Figure 13: Mollweide projection of the distribution on the sky of the four sets of SNe. The cyan 
line going across the sky is the galactic plane. It is immediately obvious that not many SNe are 
seen through the Galactic plane. Notice how the SDSS and SNLS surveys are constrained to 
very small regions of the sky, where the observers look over and over again. This helps them 
get nice early time coverage of the lightcurves. 



Figure 14: Distribution of SN redshift from different surveys. The right histogram has logarith¬ 
mic redshift axis. From low to high redshift (red, blue, green, purple, same colour coding as 
Fig 13) is low-z, SDSS, SNLS, HST. 

Count Count 




Figure 15: Distribution of the measured X\,c. A gaussian with matching mean and variance is 
superimposed. The approximate standard deviation of these distributions are cr x \ ~ 0.99 and 
a c ~ 0 . 084 . These numbers are guiding, and play absolutely no role in the fitting procedure. 
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Figure 16: Distributions in logarithmic bins, of the square root of the diagonal elements of the 
covariance matrix for the correction parameters x\,c. These errors are to be compared with 
the variance of the distribution in Fig. 15. It is immediately obvious that we cannot neglect 
these errors when constructing the likelihood function. Indeed this looks a bit like the example 
Unequal errors on measurements in Sec. 2.5 — we have some intrinsic distribution which we 
contaminate with experimental uncertainties, so the total error is the intrinsic error and the 
experimental error added in quadrature. Where the two errors would be approximately equal 
(taken as simply the numbers quoted below Fig. 15 divided by \/ 2 ) is marked by the blue line. 


5.2 A STATISTICAL MODEL OF CALIBRATED STANDARD CANDLES 
As stated in Sec. 4.2, the corrected SN distance modulus is taken to be 2 


Fsn = ™*b ~ (M - + M' ( 5 - 2 -i) 

Now we take it to be the true values that obey this relation. Writing down the likelihood for the 
dataset (mh, iy, t \,...) is then straight forward. With a hat on observed values, we have 

C = p[(ml,x lr c)\ 6 ] = J p[(m* B/ x 1 ,c)\(M,x 1 ,c), 6 ] p[(M,x 1 / c)\e]dMdx 1 dc (5- 2 -2) 

Here I have simply used Eq. (2.1.6) to integrate out my ignorance of the true values of the 
observables. I stress that p[(M,xi,c)\ 6 ] is my simple model of the distribution of the correction 
parameters — not a (Bayesian) prior. What we need now is just the two expressions for the pdfs 
in this equation. 

Inspired by Fig. 15 we will take the intrinsic distribution of all the parameters M, X\, c to be 
independent, 3 gaussian, and redshift independent, giving us the following pdf, 

p[(M,x lr c)\6\ = p(M\9)p(x l \e)p(c\e), where: ( 5 - 2 - 3 ) 

P(M | 0 ) = ( 27 Ttr^ o )" 1 / 2 exp|- [(M-M 0 )/cr Mo ] 2 / 2 j, 

P(xi\0) = (27n7 2 0 )" 1/2 exp j— [(x a - x 0 ) /a Xo } 2 /lj, 

p(e\0) = (27T(r 2 )- 1/2 exp|-[(c-Co)/c7- Co ] 2 /2}. 

All the 6 parameters {Mo, cr Mo , xq, cr Xg/ Co, cr Co } are fitted along with the cosmological parameters, 
since we have no theoretical model for what they should be. To simplify our calculations, we 

Note that we do not include the newer host galaxy mass correction employed in [78], as it is not of immediate relevance to 
the problem we are addressing. The change due to the exclusion of this parameter is negligible. 

This is easily extended to correlated distributions if one wants to do so — here we simply use the minimal reasonable 
amount of parameters 
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write this in terms of a vector Y — {Mi, x 11; c\,... Mm, xin, c n}/ the corresponding zero-points 
Yq, and the covariance matrix E; = diag(c^ , a^ 0 , cr? 0 ,...), 


1-1/2. 


p[(M,xi,c)| 0 ] = p[Y\Yq, 6 ] — | 27 tE;| - exp 


-(Y-Y 0 )E- 1 (y-y 0 ) T /2 


(5-2-4) 


Note that this gaussian approximation is just the simplest reasonable model for these data. 
Introducing skewness or other such distributions may obviously lead to higher likelihoods. The 
main point here is that we desperately need some model, and we have no theoretical motivation for 
any one over another — we merely pick the simplest one. 

Taking the experimental errors, statistical as well as systematic, to be described by gaussians 
as well, gives us the following pdf 


P (x\x,e) 


| 27 rE d | 1/2 exp 


— (X — X)EJ 1 (X — X ) t /2 


(5-2-5) 


where X = ...}, X is the observed X, and E d is the estimated experimental 

covariance matrix. To combine the two, we write 


X - X = (Z 2 T 1 - Y)A where (5.2.6) 

Z = {m * B1 -ji C 1 ,x n ,c lr ...}, 

/10 0 \ 

-a 1 0 0 

A = /3 0 1 

V 0 ■■■) 


where the 3-by-3 block repeats all the way down, and all other elements are zero. Hence 
p[X\X, 9 ] — p[Z\Y, 6 ] and we get for the likelihood 


c = J p[Z\Y,6] p[y|y o ,0]dY 

- | 27 TE d I- 1 / 2 | 27 rE z I- 1/2 y dY 
xexp(-(Y-Y 0 )E / ” 1 (Y-Y 0 ) T / 2 ) 

x exp (-(Y - ZA” 1 )AE ( 7 1 A t (Y - ZA” 1 ) T / 2 ^ 

- |27r(E d +A T E ; A)r 1/2 

x exp [-(Z- Y 0 A)(E d + A T E / A)" 1 (Z- Y 0 A ) T /2 . 


(5-2-7) 


(5.2.8) 


The gaussian form of the integrand makes this integral very simple. Remember that this 
likelihood reflects not just our calibration of the SNe, but also the modelling of the correction 
parameters. From this likelihood we will find the MLE and derive confidence limits on 
both cosmological quantities, O m , Da, but are also able to place constraints on the absolute 
magnitude scatter, the correction coefficients and the distributions of correction parameters. All 
these tell us about how good our model is and how well our candles are calibrated — how 
standard they really are. 


5.3 RESULTS OF THE MAIN FIT 

In this section I will present the main result of the work — the MLE and confidence regions of 
our fit to the latest, greatest catalogue of SNe to date. 4 Tab. 1 presents the best fits under specific 

4 The code and data used in the analysis is available for the interested at http://nbia.dk/astroparticle/SNMLE/. It 
uses Python 2.7 and the scientific library SciPy, both of which are open source. 
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Table 1: MLE under specific constraints, marked in boldface. Ax 2 here is short for 
- 2 log £/r max . (— 2 log £ max - - 214 . 97 ) 


Constraint 

AX 2 

O m 

n A 

Oi 

x 0 

Cro 

0 

Co 


M 0 

CMo 

None (best fit) 

0 

0.341 

0.569 

0.134 

0.038 

0.932 

3.059 

-0.016 

0.071 

-19.052 

0.108 

Flat geometry 

0.147 

0.376 

0.624 

0.135 

0.039 

0.932 

3.060 

-0.016 

0.071 

-19.055 

0.108 

Empty universe 

11.9 

0.000 

0.000 

0.1.33 

0.034 

0.932 

3.051 

-0.015 

0.071 

-19.014 

0.109 

N on-accelerating 

11.0 

0.068 

0.034 

0.132 

0.033 

0.931 

3.045 

-0.013 

0.071 

-19.006 

0.109 

Matter-less universe 

10.4 

0.000 

0.094 

0.134 

0.036 

0.932 

3.059 

-0.017 

0.071 

-19.032 

0.109 

Einstein-deSitter 

221.97 

1.000 

0.000 

0.123 

0.014 

0.927 

3.039 

0.009 

0.072 

-18.839 

0.125 


constraints. The parameters are described in the previous section. In particular we see that the 
calibration brings the intrinsic (or at least unaccounted-for) variation to 0.108 mag. We also get 
out the variances of the correction parameter distribution, which are relatively independent 
of the other parameters. Compared to the numbers quoted in Fig. 13 we see that the effect of 
experimental uncertainties is most pronounced in the colour correction, c, which might have 
been anticipated from Fig. 16. 

In the spirit of cosmology. Fig. 17 presents the 1 , 2 , and 3 a contours of the Q m , f] ; \ profile 
likelihood and the 1 and 2 a contours of the full likelihood, projected to the Cl m , fl,\-plane (see 
Fig. 5 for how the two differ). From this figure, and the tabulated numbers, we see in particular 



Figure 17: Contour plot of the profile likelihood in the Q m , fi A plane. 1, 2 and 3 a contours, 
regarding all other parameters as nuisance parameters, are shown as red dashed lines. Blue 
lines mark the 1 and 2 cr 10D contours projected on to the plane (see Fig. 5 — the 2D contours 
describe the confidence in only a u ,, while the 10D are the joint contour of a | and a w , but only 
shown in a ul space because of the dimensional limitations of paper). 
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that the non-accelerating universe is excluded at about 3 a — not overwhelming evidence. 5 The 
Hubble plot we find is presented in Figs. 18 and 19. 



Redshift 


Figure 18: Comparison of data and model. The measured distance magnitudes, /*sn — 
wig — Mo + ai'i — / 5 c with different markers depending on the survey. The expected value in 
two cosmological models are also plotted. ACDM is the best fit accelerating universe, and 
Milne is an empty universe expanding with constant velocity. The error bars are the square 
root of the diagonal elements of £/ + A T 'Ej/M 1 , and include both experimental uncertainties 
and intrinsic dispersion. These error bars are therefore mildly correlated. 


Now we want to assess how good the fit is. We simply put in gaussians as our model before, 
now we want to see how well this actually describes the data, along with a determination of 
how well the statistical approximations we do are — this is not a linear model, yet we use Wilks' 
theorem to find confidence regions. In Fig. 20 are the pulls, defined as the residual, normalised 
to the combined expected error, 

pull - (Z-Y 0 A)U~\ (5.3.1) 

where U is the upper triangular Cholesky factor of the estimated covariance matrix, ie. U T U — 
Ej + A t E/A. We see from the figure that while it is not a perfect description, neither is it 
obviously invalid. Performing a series of goodness-of-fit tests of the pull distribution to a normal 
distribution, we get the p-values in Tab. 2. All four tests here are looking at the cumulative 
distribution function in different ways. Looking at more targeted tests may give radically 
different answers. In particular the skewness (the third moment) is off, which might have been 
anticipated already from the distribution of X\. 

We also perform a simple MC test to check that the confidence levels we set by Wilks' 
theorem are good. To do so, we simulate 10 4 datasets, assuming the model to be correct and 
taking the best fit parameters of the actual data as the model. 6 We keep the z-values, but draw 
new M, x\, c values from the predicted distributions. For every one of these datasets we find the 
best fit parameters, and in particular the maximum likelihood. Wilks' theorem now states that 
the distribution of the quantity —21 og£true/ £max is a y 2 with 10 degrees of freedom, where 
£true is the likelihood of the true parameters. The distribution from MC and the analytic curve 
are plotted in Fig. 21. We see that Wilks' theorem holds true to very high precision, and so we 
can indeed trust the confidence levels set by the likelihood ratio. 

The usual criterion in particle physics is 5a, or a local p-value of about 5.7 ■ 10~ 7 , for a rigorous discovery, however see 
[ 80 , 81 ] for a discussion of this convention. 

This choice is the most relevant, but is not important — we could in principle choose any value. 
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Figure 19: As Fig. 18, but with the Milne model subtracted. The Milne model plotted has had 
its Hubble constant corrected by changing the zero point slightly to correct for the change in 
M 0 — see the discussion following Eq. (4.3.1). 
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Figure 20: Pull distribution of the best fit model. Pulls are defined as described in Eq. (5.3.1). 
According to our very simple model, everything should be gaussian. Therefore we superimpose 
a normalised gaussian with the expected Poisson noise (1 / VN). 
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Table 2: p-values from testing the hypothesis that the pull distribution follows a normal 
distribution, A/"( 0 , 1 ). 


Test 

statistic 

p-value 

Anderson-Darling 

2.528 

0.0479 

Cramer-von Mises 

0.454 

0.0522 

Kolmogorov-Smirnov 

0.0244 

0.1389 

Kuiper 

0.0329 

0.1231 


PDF 



Figure 21: Distribution of the MC likelihood ratio as defined in the text. The expected y 2 
distribution is superimposed. The excellent agreement between the two reinforces our trust in 
Wilks' theorem. 


5.4 OLDER ANALYSES 

To emphasise the strength of the present analysis, it is instructive to look at the exact differences 
between this and previous analyses. The previous analyses are mainly in two categories: a 
likelihood based one, which I argue is not a good fit, and a residual based one, which in 
particular is not a likelihood maximisation, and as such, what we learned in Chap. 2 no longer 
applies. Let us first take a look at these other methods and then return to a brief comparison 
with the proposed approach. 


5.4.1 Residual based method 


This method is the most prominent method used in analyses using the SALT method (or other 
methods like it) of lightcurve fitting, see eg. [2, 78, 79, 82-84]. The exact procedure varies 
slightly, but the main point is that one considers the quantity, which is sometimes called a y 2 , 
but for pedagogical purposes I will simply call it /, 


/=(m 

~E 


g — M + olx\ 
Afi 2 

(T 2 4 - 


fie - p) [diag(c 4 f ) + C(a,/ 3 )] 1 (m* B - M + ax l - pc-p) 


( 5 - 4 -i) 


where I have written out explicitly the term cr inf in the covariance matrix C, which I write 
schematically as u 2 ; = Yjab^d,i)ab r n r br where r — (1, ci, —ft) mixes in the relevant covariances of 
X\, c, and Z,/ is the covariance matrix as used in the previous section. The newest version of this 
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type of analysis tries to determine independently the cr lnt ([78]), while most other analyses have 
done something curious. First the / is minimised with some plausible value of cr mt inserted — 
typically a guess based on previous analyses. When the minimum is found, one then adjusts the 
cr int such that f is equal to the number of degrees of freedom, say N. As we saw in Sec. 2.4.2 this is 
indeed reminiscent of fitting the unknown intrinsic error (the degree to which SNe are standard 
candles). But, as I stress in Sec. 2.5, not even the maximum likelihood estimate satisfies this 
exactly, when the errors on the datapoints are unequal. When the / has now been fixed, the 
minimum might have moved slightly, so the procedure is iterated until convergence. 

The most alarming thing is now, that confidence regions are put in place by Wilks' theorem 
— even though as we just saw, this method is not derived from a likelihood, and as such using Wilks' 
theorem is manifestly nonsense. However, as we will see, since the guess / is an educated one, the 
limits one sets this way are not completely off. 


5.4.2 Simple likelihood 


It has been realised, that the previous method was indeed not derived from a likelihood, 
see [85-88]. If / is to be interpreted as something like — 21 og£, then we have to impose a 
normalisation, namely f £ d( data) = 1 . Taking just the inf as the data, we see that the previous 
expression (5.4.1) should come from a likelihood, which takes the form 

C w = Yl^Tzlrf + of nt ])~ 1/2 exp (-l 2 A + 2 ) ( 5 - 4 - 2 ) 

\ a }t~n a int J 

Now we can perform a rigorous likelihood maximisation, and construct confidence regions 
using Wilks' theorem. But let's first look closer at the likelihood we have constructed. As just 
stated, the way I constructed this was by regarding the integral as going over only the mf. But 
there is no reason why mj should enjoy a privileged status compared to X\, c. It is immediate 
that trying to integrate over these datapoints will give infinity. So let's see where we went 
wrong. To do so, we need to go back to the basics of constructing the likelihood, Eq. (5.2.2). 
Now put in flat distributions of X\,c, ie. instead of all gaussians, we now take 

P(x i|0) o< 1 

p(c\ 9 ) « 1 (5.4.3) 


where in principle we need some compact support for these distributions to have a finite (unit) 
normalisation of the likelihood. Putting these distributions into Eq. (5.2.2), with the gaussian 
measurement errors and p(M\ 0 ), we have, using the notation of the last section 


£ 7/7 ex 


^\ 2 nZ d \ 

X ( 2 ^Mn)" 1/2 


exp 


--(X-X)£ rf - 1 (X-X) 


dx\ dc 


exp 


{- [(M-M 0 )/n Mo ] 2 / 2 } dM 


( 5 - 4 - 4 ) 


Performing first the X\, c integrals simply mixes in the uncertainties and swaps X\, c for X\,c in 
X. We end with, writing for simplicity just a diagonal covariance matrix. 


1 [™*Bi ~ ( f ( Z 0 + M ~ XX\j + fiCj )} 2 

2 YLabi^i) ab^n^b 

x (2 n°htT %n ex P { - K M - Mo) /(Tint } 2 /2 j dM (5.4.5) 

The error Yjab{'^- j i)ab r a r b here is what we schematically called before. Performing the M 
integral now simply adds the c rnt error and swaps M for Mo in the residual, which gives us 
the expected expression, given in Eq. (5.4.2) — up to a constant, which in principle is infinite. 
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exp 
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PUTTING THE PIECES TOGETHER 


Putting in compact support of the X\,c distributions complicates the integrals, but provided the 
limits are far away from the interesting region, the results will be almost the same. I won't go 
into details about fitting these limits. The main point is that now we know exactly how this 
likelihood comes about, and so can say confidently that it is wrong, as we see quite clearly from 
Fig. 15, that we have no reason to believe the distributions of X\,c are flat in the full range of 
the variables. One can of course impose narrow uniform distributions, but this will still include 
two extra parameters in the fit, just as the gaussian distributions do — there is no such thing as 
a free distribution. 


5.4.3 Comparison with the present analysis 


We have already seen how the naive likelihood based method compares to the present analysis. 
For the present purposes, let us rewrite the likelihood of Eq. (5.2.8) as 


£= |27r(S d + A T S / A)| 1/2 

x exp [-(ZA- 1 - Y 0 ){A T - 1 Z d A ~ 1 +Z l )- 1 (ZA - 1 - Y 0 ) x /2 


(5.4.6) 


Taking for simplicity Xj to be diagonal, we see that the combination of every third term in the 
expansion of the sum looks just like Eq. (5.4.1), notice the form of the residual here is 

ZA -1 - Y 0 = (mg - M 0 + olxi - ,6c - p, - x w , c - c 0 ,...) ( 54 - 7 ) 


This means we are more or less fitting the same expressions we were before. The important 
thing, which has been left out of the other analyses are the terms linking residuals in 5 t\, c to 
those of in } — the off-diagonal terms of A T 1 Xj A ~ 1 + X,. Since these analyses did not consider 
residuals of these terms, they obviously cannot include these corrections. 

Now, that is not to say that the residual based method will give a wrong result. We just 
need to show it by some other means than for the MLE. In particular, since we cannot do 
the calculation analytically, we will have to show it by simulations. To do such a simulation 
though, we have to assume a distribution from which we draw the values of X\ and c. To show 
how well the two methods should agree, I reuse the MC from Sec. 5.3 and now do both a fit 
with our MLE method and the residual based method. As a slight correction to the method, 
I use for the cq nt term in the residual method the value I find from the MLE. Now, for every 
simulated dataset, I plot the difference in the obtained parameters in Fig. 22. This MC study is 
then compared to the obtained values of the actual dataset. Fitting the original dataset by the 
residual method, our best fit is 


{ 0 m , 0 A ,«,/3 ,M 0 } = { 0 . 200 , 0 . 591 , 0 . 134 , 3 . 08 ,- 19 . 07 } (5.4.8) 

First of all, notice that the two methods actually agree on average! This is somewhat surprising, 
but certainly possible. This just means that within this model the two methods agree, more 
or less — to the degree of spread in the figure. However, looking at the value of the real 
dataset, we see that it doesn't quite agree. To put this into numbers, I first calculate the sample 
covariance of the MC values, X. Seeing the distribution of MC points as an estimate of the pdf, 
X is the covariance of the 5d approximate gaussian distribution. We can now calculate a y 2 of 
the difference we see in the real data, as 

Ay 2 = Aparameters • X“ 1 • Aparameters % 22 . 73 , (54-9) 

which for a y 2 distribution with 5 degrees of freedom is about 3 . 6 cr. That means, taking into 
consideration both fits, it is rather unlikely that they would differ by this much, if the gaussian 
model is correct. This fit is explicitly carried out with gaussian distributions — even for the 
residual method, which superficially completely disregards this information. 



5-4 OLDER ANALYSES 
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Figure 22: Correlations between MLE and residual method parameters, for the relevant pa¬ 
rameters. Plotted is MLE—residual estimate. The dashed ellipses show approximate 1 , 2 , 3 cr 2d 
profile contours. The blue markers show the value obtained from the real data. We see that in 
particular the fi m , n,\ point is off. In total, there are 9945 simulated datasets plotted. 
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PUTTING THE PIECES TOGETHER 


It is important to realise that any result obtained by the residual based method can only be 
validated by a MC study. This does by no means forbid it be a good estimator. What is dangerous 
though, is that one cannot extract from this the parameters of the underlying distributions, and 
so we lose the ability to eg. plot Fig. 20 in this framework. We lose the ability to assess the 
model of correction parameter which I emphasise, we must assume to validate the method, whether 
we like it or not. In particular, doing the MLE method, we have made only completely standard 
assumptions. 



CONCLUSION 



In this thesis I have presented in detail my knowledge of fitting cosmological parameters 
with supernovae. I analysed the latest large compilation of supernovae and found a result in 
significant contrast to the canonical result. In particular, a non-accelerating universe is only 
excluded at 3 cr by supernovae alone. This is not to say that supernovae prefer this universe — 
the best fit point of this very simple model is still with a significant dark energy contribution! What 
we have done is to state explicitly all assumptions that are usually made. These assumptions 
will surely change, even in the near future. Preliminary work such as [89, 90] suggests using 
more and more contrived versions of the residuals based method due to selection effects, and 
a very recent study proposes two populations of supernovae [91]. I have in this analysis not 
made any such assumptions or speculations. The result I present has standard assumptions 
and models — only we need to state this explicitly, because we have to write down a likelihood. 

This effort, I think, has two obvious products. First of all, we now have well defined 
confidence regions, readily compared to other analyses. We know at all times exactly what we 
are fitting and modelling, because we are forced to write all this down. Secondly, it gives an 
avenue to further exploration of the correction parameter distributions. As stated before, it is 
hardly believable that the distributions should be exactly gaussian, or that there is no evolution 
of supernovae during the evolution of the universe. The method presented here is very easily 
capable of dealing with this issue. 
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